Spaces:
Running
Running
Commit ·
bf7914e
1
Parent(s): 2892945
sync before github rebase
Browse files- .agents/skills/frontend-design/LICENSE.txt +177 -0
- .agents/skills/frontend-design/SKILL.md +42 -0
- .agents/skills/hf-cli/SKILL.md +188 -0
- .agents/skills/reinforcement-learning/SKILL.md +20 -0
- .agents/skills/reinforcement-learning/references/patterns.md +183 -0
- .agents/skills/reinforcement-learning/references/sharp_edges.md +187 -0
- .agents/skills/reinforcement-learning/references/validations.md +129 -0
- .gitignore +1 -0
- skills-lock.json +20 -0
.agents/skills/frontend-design/LICENSE.txt
ADDED
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
Apache License
|
| 3 |
+
Version 2.0, January 2004
|
| 4 |
+
http://www.apache.org/licenses/
|
| 5 |
+
|
| 6 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 7 |
+
|
| 8 |
+
1. Definitions.
|
| 9 |
+
|
| 10 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 11 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 12 |
+
|
| 13 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 14 |
+
the copyright owner that is granting the License.
|
| 15 |
+
|
| 16 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 17 |
+
other entities that control, are controlled by, or are under common
|
| 18 |
+
control with that entity. For the purposes of this definition,
|
| 19 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 20 |
+
direction or management of such entity, whether by contract or
|
| 21 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 22 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 23 |
+
|
| 24 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 25 |
+
exercising permissions granted by this License.
|
| 26 |
+
|
| 27 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 28 |
+
including but not limited to software source code, documentation
|
| 29 |
+
source, and configuration files.
|
| 30 |
+
|
| 31 |
+
"Object" form shall mean any form resulting from mechanical
|
| 32 |
+
transformation or translation of a Source form, including but
|
| 33 |
+
not limited to compiled object code, generated documentation,
|
| 34 |
+
and conversions to other media types.
|
| 35 |
+
|
| 36 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 37 |
+
Object form, made available under the License, as indicated by a
|
| 38 |
+
copyright notice that is included in or attached to the work
|
| 39 |
+
(an example is provided in the Appendix below).
|
| 40 |
+
|
| 41 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 42 |
+
form, that is based on (or derived from) the Work and for which the
|
| 43 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 44 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 45 |
+
of this License, Derivative Works shall not include works that remain
|
| 46 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 47 |
+
the Work and Derivative Works thereof.
|
| 48 |
+
|
| 49 |
+
"Contribution" shall mean any work of authorship, including
|
| 50 |
+
the original version of the Work and any modifications or additions
|
| 51 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 52 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 53 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 54 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 55 |
+
means any form of electronic, verbal, or written communication sent
|
| 56 |
+
to the Licensor or its representatives, including but not limited to
|
| 57 |
+
communication on electronic mailing lists, source code control systems,
|
| 58 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 59 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 60 |
+
excluding communication that is conspicuously marked or otherwise
|
| 61 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 62 |
+
|
| 63 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 64 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 65 |
+
subsequently incorporated within the Work.
|
| 66 |
+
|
| 67 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 68 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 69 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 70 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 71 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 72 |
+
Work and such Derivative Works in Source or Object form.
|
| 73 |
+
|
| 74 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 75 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 76 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 77 |
+
(except as stated in this section) patent license to make, have made,
|
| 78 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 79 |
+
where such license applies only to those patent claims licensable
|
| 80 |
+
by such Contributor that are necessarily infringed by their
|
| 81 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 82 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 83 |
+
institute patent litigation against any entity (including a
|
| 84 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 85 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 86 |
+
or contributory patent infringement, then any patent licenses
|
| 87 |
+
granted to You under this License for that Work shall terminate
|
| 88 |
+
as of the date such litigation is filed.
|
| 89 |
+
|
| 90 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 91 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 92 |
+
modifications, and in Source or Object form, provided that You
|
| 93 |
+
meet the following conditions:
|
| 94 |
+
|
| 95 |
+
(a) You must give any other recipients of the Work or
|
| 96 |
+
Derivative Works a copy of this License; and
|
| 97 |
+
|
| 98 |
+
(b) You must cause any modified files to carry prominent notices
|
| 99 |
+
stating that You changed the files; and
|
| 100 |
+
|
| 101 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 102 |
+
that You distribute, all copyright, patent, trademark, and
|
| 103 |
+
attribution notices from the Source form of the Work,
|
| 104 |
+
excluding those notices that do not pertain to any part of
|
| 105 |
+
the Derivative Works; and
|
| 106 |
+
|
| 107 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 108 |
+
distribution, then any Derivative Works that You distribute must
|
| 109 |
+
include a readable copy of the attribution notices contained
|
| 110 |
+
within such NOTICE file, excluding those notices that do not
|
| 111 |
+
pertain to any part of the Derivative Works, in at least one
|
| 112 |
+
of the following places: within a NOTICE text file distributed
|
| 113 |
+
as part of the Derivative Works; within the Source form or
|
| 114 |
+
documentation, if provided along with the Derivative Works; or,
|
| 115 |
+
within a display generated by the Derivative Works, if and
|
| 116 |
+
wherever such third-party notices normally appear. The contents
|
| 117 |
+
of the NOTICE file are for informational purposes only and
|
| 118 |
+
do not modify the License. You may add Your own attribution
|
| 119 |
+
notices within Derivative Works that You distribute, alongside
|
| 120 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 121 |
+
that such additional attribution notices cannot be construed
|
| 122 |
+
as modifying the License.
|
| 123 |
+
|
| 124 |
+
You may add Your own copyright statement to Your modifications and
|
| 125 |
+
may provide additional or different license terms and conditions
|
| 126 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 127 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 128 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 129 |
+
the conditions stated in this License.
|
| 130 |
+
|
| 131 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 132 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 133 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 134 |
+
this License, without any additional terms or conditions.
|
| 135 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 136 |
+
the terms of any separate license agreement you may have executed
|
| 137 |
+
with Licensor regarding such Contributions.
|
| 138 |
+
|
| 139 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 140 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 141 |
+
except as required for reasonable and customary use in describing the
|
| 142 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 143 |
+
|
| 144 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 145 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 146 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 147 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 148 |
+
implied, including, without limitation, any warranties or conditions
|
| 149 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 150 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 151 |
+
appropriateness of using or redistributing the Work and assume any
|
| 152 |
+
risks associated with Your exercise of permissions under this License.
|
| 153 |
+
|
| 154 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 155 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 156 |
+
unless required by applicable law (such as deliberate and grossly
|
| 157 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 158 |
+
liable to You for damages, including any direct, indirect, special,
|
| 159 |
+
incidental, or consequential damages of any character arising as a
|
| 160 |
+
result of this License or out of the use or inability to use the
|
| 161 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 162 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 163 |
+
other commercial damages or losses), even if such Contributor
|
| 164 |
+
has been advised of the possibility of such damages.
|
| 165 |
+
|
| 166 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 167 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 168 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 169 |
+
or other liability obligations and/or rights consistent with this
|
| 170 |
+
License. However, in accepting such obligations, You may act only
|
| 171 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
| 172 |
+
of any other Contributor, and only if You agree to indemnify,
|
| 173 |
+
defend, and hold each Contributor harmless for any liability
|
| 174 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 175 |
+
of your accepting any such warranty or additional liability.
|
| 176 |
+
|
| 177 |
+
END OF TERMS AND CONDITIONS
|
.agents/skills/frontend-design/SKILL.md
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: frontend-design
|
| 3 |
+
description: Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, artifacts, posters, or applications (examples include websites, landing pages, dashboards, React components, HTML/CSS layouts, or when styling/beautifying any web UI). Generates creative, polished code and UI design that avoids generic AI aesthetics.
|
| 4 |
+
license: Complete terms in LICENSE.txt
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
This skill guides creation of distinctive, production-grade frontend interfaces that avoid generic "AI slop" aesthetics. Implement real working code with exceptional attention to aesthetic details and creative choices.
|
| 8 |
+
|
| 9 |
+
The user provides frontend requirements: a component, page, application, or interface to build. They may include context about the purpose, audience, or technical constraints.
|
| 10 |
+
|
| 11 |
+
## Design Thinking
|
| 12 |
+
|
| 13 |
+
Before coding, understand the context and commit to a BOLD aesthetic direction:
|
| 14 |
+
- **Purpose**: What problem does this interface solve? Who uses it?
|
| 15 |
+
- **Tone**: Pick an extreme: brutally minimal, maximalist chaos, retro-futuristic, organic/natural, luxury/refined, playful/toy-like, editorial/magazine, brutalist/raw, art deco/geometric, soft/pastel, industrial/utilitarian, etc. There are so many flavors to choose from. Use these for inspiration but design one that is true to the aesthetic direction.
|
| 16 |
+
- **Constraints**: Technical requirements (framework, performance, accessibility).
|
| 17 |
+
- **Differentiation**: What makes this UNFORGETTABLE? What's the one thing someone will remember?
|
| 18 |
+
|
| 19 |
+
**CRITICAL**: Choose a clear conceptual direction and execute it with precision. Bold maximalism and refined minimalism both work - the key is intentionality, not intensity.
|
| 20 |
+
|
| 21 |
+
Then implement working code (HTML/CSS/JS, React, Vue, etc.) that is:
|
| 22 |
+
- Production-grade and functional
|
| 23 |
+
- Visually striking and memorable
|
| 24 |
+
- Cohesive with a clear aesthetic point-of-view
|
| 25 |
+
- Meticulously refined in every detail
|
| 26 |
+
|
| 27 |
+
## Frontend Aesthetics Guidelines
|
| 28 |
+
|
| 29 |
+
Focus on:
|
| 30 |
+
- **Typography**: Choose fonts that are beautiful, unique, and interesting. Avoid generic fonts like Arial and Inter; opt instead for distinctive choices that elevate the frontend's aesthetics; unexpected, characterful font choices. Pair a distinctive display font with a refined body font.
|
| 31 |
+
- **Color & Theme**: Commit to a cohesive aesthetic. Use CSS variables for consistency. Dominant colors with sharp accents outperform timid, evenly-distributed palettes.
|
| 32 |
+
- **Motion**: Use animations for effects and micro-interactions. Prioritize CSS-only solutions for HTML. Use Motion library for React when available. Focus on high-impact moments: one well-orchestrated page load with staggered reveals (animation-delay) creates more delight than scattered micro-interactions. Use scroll-triggering and hover states that surprise.
|
| 33 |
+
- **Spatial Composition**: Unexpected layouts. Asymmetry. Overlap. Diagonal flow. Grid-breaking elements. Generous negative space OR controlled density.
|
| 34 |
+
- **Backgrounds & Visual Details**: Create atmosphere and depth rather than defaulting to solid colors. Add contextual effects and textures that match the overall aesthetic. Apply creative forms like gradient meshes, noise textures, geometric patterns, layered transparencies, dramatic shadows, decorative borders, custom cursors, and grain overlays.
|
| 35 |
+
|
| 36 |
+
NEVER use generic AI-generated aesthetics like overused font families (Inter, Roboto, Arial, system fonts), cliched color schemes (particularly purple gradients on white backgrounds), predictable layouts and component patterns, and cookie-cutter design that lacks context-specific character.
|
| 37 |
+
|
| 38 |
+
Interpret creatively and make unexpected choices that feel genuinely designed for the context. No design should be the same. Vary between light and dark themes, different fonts, different aesthetics. NEVER converge on common choices (Space Grotesk, for example) across generations.
|
| 39 |
+
|
| 40 |
+
**IMPORTANT**: Match implementation complexity to the aesthetic vision. Maximalist designs need elaborate code with extensive animations and effects. Minimalist or refined designs need restraint, precision, and careful attention to spacing, typography, and subtle details. Elegance comes from executing the vision well.
|
| 41 |
+
|
| 42 |
+
Remember: Claude is capable of extraordinary creative work. Don't hold back, show what can truly be created when thinking outside the box and committing fully to a distinctive vision.
|
.agents/skills/hf-cli/SKILL.md
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: hf-cli
|
| 3 |
+
description: "Hugging Face Hub CLI (`hf`) for downloading, uploading, and managing repositories, models, datasets, and Spaces on the Hugging Face Hub. Replaces now deprecated `huggingface-cli` command."
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
Install: `curl -LsSf https://hf.co/cli/install.sh | bash -s`.
|
| 7 |
+
|
| 8 |
+
The Hugging Face Hub CLI tool `hf` is available. IMPORTANT: The `hf` command replaces the deprecated `huggingface-cli` command.
|
| 9 |
+
|
| 10 |
+
Use `hf --help` to view available functions. Note that auth commands are now all under `hf auth` e.g. `hf auth whoami`.
|
| 11 |
+
|
| 12 |
+
Generated with `huggingface_hub v1.8.0`. Run `hf skills add --force` to regenerate.
|
| 13 |
+
|
| 14 |
+
## Commands
|
| 15 |
+
|
| 16 |
+
- `hf download REPO_ID` — Download files from the Hub. `[--type CHOICE --revision TEXT --include TEXT --exclude TEXT --cache-dir TEXT --local-dir TEXT --force-download --dry-run --quiet --max-workers INTEGER]`
|
| 17 |
+
- `hf env` — Print information about the environment.
|
| 18 |
+
- `hf sync` — Sync files between local directory and a bucket. `[--delete --ignore-times --ignore-sizes --plan TEXT --apply TEXT --dry-run --include TEXT --exclude TEXT --filter-from TEXT --existing --ignore-existing --verbose --quiet]`
|
| 19 |
+
- `hf upload REPO_ID` — Upload a file or a folder to the Hub. Recommended for single-commit uploads. `[--type CHOICE --revision TEXT --private --include TEXT --exclude TEXT --delete TEXT --commit-message TEXT --commit-description TEXT --create-pr --every FLOAT --quiet]`
|
| 20 |
+
- `hf upload-large-folder REPO_ID LOCAL_PATH` — Upload a large folder to the Hub. Recommended for resumable uploads. `[--type CHOICE --revision TEXT --private --include TEXT --exclude TEXT --num-workers INTEGER --no-report --no-bars]`
|
| 21 |
+
- `hf version` — Print information about the hf version.
|
| 22 |
+
|
| 23 |
+
### `hf auth` — Manage authentication (login, logout, etc.).
|
| 24 |
+
|
| 25 |
+
- `hf auth list` — List all stored access tokens.
|
| 26 |
+
- `hf auth login` — Login using a token from huggingface.co/settings/tokens. `[--add-to-git-credential --force]`
|
| 27 |
+
- `hf auth logout` — Logout from a specific token. `[--token-name TEXT]`
|
| 28 |
+
- `hf auth switch` — Switch between access tokens. `[--token-name TEXT --add-to-git-credential]`
|
| 29 |
+
- `hf auth whoami` — Find out which huggingface.co account you are logged in as. `[--format CHOICE]`
|
| 30 |
+
|
| 31 |
+
### `hf buckets` — Commands to interact with buckets.
|
| 32 |
+
|
| 33 |
+
- `hf buckets cp SRC` — Copy a single file to or from a bucket. `[--quiet]`
|
| 34 |
+
- `hf buckets create BUCKET_ID` — Create a new bucket. `[--private --exist-ok --quiet]`
|
| 35 |
+
- `hf buckets delete BUCKET_ID` — Delete a bucket. `[--yes --missing-ok --quiet]`
|
| 36 |
+
- `hf buckets info BUCKET_ID` — Get info about a bucket. `[--quiet]`
|
| 37 |
+
- `hf buckets list` — List buckets or files in a bucket. `[--human-readable --tree --recursive --format CHOICE --quiet]`
|
| 38 |
+
- `hf buckets move FROM_ID TO_ID` — Move (rename) a bucket to a new name or namespace.
|
| 39 |
+
- `hf buckets remove ARGUMENT` — Remove files from a bucket. `[--recursive --yes --dry-run --include TEXT --exclude TEXT --quiet]`
|
| 40 |
+
- `hf buckets sync` — Sync files between local directory and a bucket. `[--delete --ignore-times --ignore-sizes --plan TEXT --apply TEXT --dry-run --include TEXT --exclude TEXT --filter-from TEXT --existing --ignore-existing --verbose --quiet]`
|
| 41 |
+
|
| 42 |
+
### `hf cache` — Manage local cache directory.
|
| 43 |
+
|
| 44 |
+
- `hf cache list` — List cached repositories or revisions. `[--cache-dir TEXT --revisions --filter TEXT --format CHOICE --quiet --sort CHOICE --limit INTEGER]`
|
| 45 |
+
- `hf cache prune` — Remove detached revisions from the cache. `[--cache-dir TEXT --yes --dry-run]`
|
| 46 |
+
- `hf cache rm TARGETS` — Remove cached repositories or revisions. `[--cache-dir TEXT --yes --dry-run]`
|
| 47 |
+
- `hf cache verify REPO_ID` — Verify checksums for a single repo revision from cache or a local directory. `[--type CHOICE --revision TEXT --cache-dir TEXT --local-dir TEXT --fail-on-missing-files --fail-on-extra-files]`
|
| 48 |
+
|
| 49 |
+
### `hf collections` — Interact with collections on the Hub.
|
| 50 |
+
|
| 51 |
+
- `hf collections add-item COLLECTION_SLUG ITEM_ID ITEM_TYPE` — Add an item to a collection. `[--note TEXT --exists-ok]`
|
| 52 |
+
- `hf collections create TITLE` — Create a new collection on the Hub. `[--namespace TEXT --description TEXT --private --exists-ok]`
|
| 53 |
+
- `hf collections delete COLLECTION_SLUG` — Delete a collection from the Hub. `[--missing-ok]`
|
| 54 |
+
- `hf collections delete-item COLLECTION_SLUG ITEM_OBJECT_ID` — Delete an item from a collection. `[--missing-ok]`
|
| 55 |
+
- `hf collections info COLLECTION_SLUG` — Get info about a collection on the Hub. Output is in JSON format.
|
| 56 |
+
- `hf collections list` — List collections on the Hub. `[--owner TEXT --item TEXT --sort CHOICE --limit INTEGER --format CHOICE --quiet]`
|
| 57 |
+
- `hf collections update COLLECTION_SLUG` — Update a collection's metadata on the Hub. `[--title TEXT --description TEXT --position INTEGER --private --theme TEXT]`
|
| 58 |
+
- `hf collections update-item COLLECTION_SLUG ITEM_OBJECT_ID` — Update an item in a collection. `[--note TEXT --position INTEGER]`
|
| 59 |
+
|
| 60 |
+
### `hf datasets` — Interact with datasets on the Hub.
|
| 61 |
+
|
| 62 |
+
- `hf datasets info DATASET_ID` — Get info about a dataset on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
|
| 63 |
+
- `hf datasets list` — List datasets on the Hub. `[--search TEXT --author TEXT --filter TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
|
| 64 |
+
- `hf datasets parquet DATASET_ID` — List parquet file URLs available for a dataset. `[--subset TEXT --split TEXT --format CHOICE --quiet]`
|
| 65 |
+
- `hf datasets sql SQL` — Execute a raw SQL query with DuckDB against dataset parquet URLs. `[--format CHOICE]`
|
| 66 |
+
|
| 67 |
+
### `hf discussions` — Manage discussions and pull requests on the Hub.
|
| 68 |
+
|
| 69 |
+
- `hf discussions close REPO_ID NUM` — Close a discussion or pull request. `[--comment TEXT --yes --type CHOICE]`
|
| 70 |
+
- `hf discussions comment REPO_ID NUM` — Comment on a discussion or pull request. `[--body TEXT --body-file PATH --type CHOICE]`
|
| 71 |
+
- `hf discussions create REPO_ID --title TEXT` — Create a new discussion or pull request on a repo. `[--body TEXT --body-file PATH --pull-request --type CHOICE]`
|
| 72 |
+
- `hf discussions diff REPO_ID NUM` — Show the diff of a pull request. `[--type CHOICE]`
|
| 73 |
+
- `hf discussions info REPO_ID NUM` — Get info about a discussion or pull request. `[--comments --diff --no-color --type CHOICE --format CHOICE]`
|
| 74 |
+
- `hf discussions list REPO_ID` — List discussions and pull requests on a repo. `[--status CHOICE --kind CHOICE --author TEXT --limit INTEGER --type CHOICE --format CHOICE --quiet]`
|
| 75 |
+
- `hf discussions merge REPO_ID NUM` — Merge a pull request. `[--comment TEXT --yes --type CHOICE]`
|
| 76 |
+
- `hf discussions rename REPO_ID NUM NEW_TITLE` — Rename a discussion or pull request. `[--type CHOICE]`
|
| 77 |
+
- `hf discussions reopen REPO_ID NUM` — Reopen a closed discussion or pull request. `[--comment TEXT --yes --type CHOICE]`
|
| 78 |
+
|
| 79 |
+
### `hf endpoints` — Manage Hugging Face Inference Endpoints.
|
| 80 |
+
|
| 81 |
+
- `hf endpoints catalog deploy --repo TEXT` — Deploy an Inference Endpoint from the Model Catalog. `[--name TEXT --accelerator TEXT --namespace TEXT]`
|
| 82 |
+
- `hf endpoints catalog list` — List available Catalog models.
|
| 83 |
+
- `hf endpoints delete NAME` — Delete an Inference Endpoint permanently. `[--namespace TEXT --yes]`
|
| 84 |
+
- `hf endpoints deploy NAME --repo TEXT --framework TEXT --accelerator TEXT --instance-size TEXT --instance-type TEXT --region TEXT --vendor TEXT` — Deploy an Inference Endpoint from a Hub repository. `[--namespace TEXT --task TEXT --min-replica INTEGER --max-replica INTEGER --scale-to-zero-timeout INTEGER --scaling-metric CHOICE --scaling-threshold FLOAT]`
|
| 85 |
+
- `hf endpoints describe NAME` — Get information about an existing endpoint. `[--namespace TEXT]`
|
| 86 |
+
- `hf endpoints list` — Lists all Inference Endpoints for the given namespace. `[--namespace TEXT --format CHOICE --quiet]`
|
| 87 |
+
- `hf endpoints pause NAME` — Pause an Inference Endpoint. `[--namespace TEXT]`
|
| 88 |
+
- `hf endpoints resume NAME` — Resume an Inference Endpoint. `[--namespace TEXT --fail-if-already-running]`
|
| 89 |
+
- `hf endpoints scale-to-zero NAME` — Scale an Inference Endpoint to zero. `[--namespace TEXT]`
|
| 90 |
+
- `hf endpoints update NAME` — Update an existing endpoint. `[--namespace TEXT --repo TEXT --accelerator TEXT --instance-size TEXT --instance-type TEXT --framework TEXT --revision TEXT --task TEXT --min-replica INTEGER --max-replica INTEGER --scale-to-zero-timeout INTEGER --scaling-metric CHOICE --scaling-threshold FLOAT]`
|
| 91 |
+
|
| 92 |
+
### `hf extensions` — Manage hf CLI extensions.
|
| 93 |
+
|
| 94 |
+
- `hf extensions exec NAME` — Execute an installed extension.
|
| 95 |
+
- `hf extensions install REPO_ID` — Install an extension from a public GitHub repository. `[--force]`
|
| 96 |
+
- `hf extensions list` — List installed extension commands. `[--format CHOICE --quiet]`
|
| 97 |
+
- `hf extensions remove NAME` — Remove an installed extension.
|
| 98 |
+
- `hf extensions search` — Search extensions available on GitHub (tagged with 'hf-extension' topic). `[--format CHOICE --quiet]`
|
| 99 |
+
|
| 100 |
+
### `hf jobs` — Run and manage Jobs on the Hub.
|
| 101 |
+
|
| 102 |
+
- `hf jobs cancel JOB_ID` — Cancel a Job `[--namespace TEXT]`
|
| 103 |
+
- `hf jobs hardware` — List available hardware options for Jobs
|
| 104 |
+
- `hf jobs inspect JOB_IDS` — Display detailed information on one or more Jobs `[--namespace TEXT]`
|
| 105 |
+
- `hf jobs logs JOB_ID` — Fetch the logs of a Job. `[--follow --tail INTEGER --namespace TEXT]`
|
| 106 |
+
- `hf jobs ps` — List Jobs. `[--all --namespace TEXT --filter TEXT --format TEXT --quiet]`
|
| 107 |
+
- `hf jobs run IMAGE COMMAND` — Run a Job. `[--env TEXT --secrets TEXT --label TEXT --volume TEXT --env-file TEXT --secrets-file TEXT --flavor CHOICE --timeout TEXT --detach --namespace TEXT]`
|
| 108 |
+
- `hf jobs scheduled delete SCHEDULED_JOB_ID` — Delete a scheduled Job. `[--namespace TEXT]`
|
| 109 |
+
- `hf jobs scheduled inspect SCHEDULED_JOB_IDS` — Display detailed information on one or more scheduled Jobs `[--namespace TEXT]`
|
| 110 |
+
- `hf jobs scheduled ps` — List scheduled Jobs `[--all --namespace TEXT --filter TEXT --format TEXT --quiet]`
|
| 111 |
+
- `hf jobs scheduled resume SCHEDULED_JOB_ID` — Resume (unpause) a scheduled Job. `[--namespace TEXT]`
|
| 112 |
+
- `hf jobs scheduled run SCHEDULE IMAGE COMMAND` — Schedule a Job. `[--suspend --concurrency --env TEXT --secrets TEXT --label TEXT --volume TEXT --env-file TEXT --secrets-file TEXT --flavor CHOICE --timeout TEXT --namespace TEXT]`
|
| 113 |
+
- `hf jobs scheduled suspend SCHEDULED_JOB_ID` — Suspend (pause) a scheduled Job. `[--namespace TEXT]`
|
| 114 |
+
- `hf jobs scheduled uv run SCHEDULE SCRIPT` — Run a UV script (local file or URL) on HF infrastructure `[--suspend --concurrency --image TEXT --flavor CHOICE --env TEXT --secrets TEXT --label TEXT --volume TEXT --env-file TEXT --secrets-file TEXT --timeout TEXT --namespace TEXT --with TEXT --python TEXT]`
|
| 115 |
+
- `hf jobs stats` — Fetch the resource usage statistics and metrics of Jobs `[--namespace TEXT]`
|
| 116 |
+
- `hf jobs uv run SCRIPT` — Run a UV script (local file or URL) on HF infrastructure `[--image TEXT --flavor CHOICE --env TEXT --secrets TEXT --label TEXT --volume TEXT --env-file TEXT --secrets-file TEXT --timeout TEXT --detach --namespace TEXT --with TEXT --python TEXT]`
|
| 117 |
+
|
| 118 |
+
### `hf models` — Interact with models on the Hub.
|
| 119 |
+
|
| 120 |
+
- `hf models info MODEL_ID` — Get info about a model on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
|
| 121 |
+
- `hf models list` — List models on the Hub. `[--search TEXT --author TEXT --filter TEXT --num-parameters TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
|
| 122 |
+
|
| 123 |
+
### `hf papers` — Interact with papers on the Hub.
|
| 124 |
+
|
| 125 |
+
- `hf papers info PAPER_ID` — Get info about a paper on the Hub. Output is in JSON format.
|
| 126 |
+
- `hf papers list` — List daily papers on the Hub. `[--date TEXT --week TEXT --month TEXT --submitter TEXT --sort CHOICE --limit INTEGER --format CHOICE --quiet]`
|
| 127 |
+
- `hf papers read PAPER_ID` — Read a paper as markdown.
|
| 128 |
+
- `hf papers search QUERY` — Search papers on the Hub. `[--limit INTEGER --format CHOICE --quiet]`
|
| 129 |
+
|
| 130 |
+
### `hf repos` — Manage repos on the Hub.
|
| 131 |
+
|
| 132 |
+
- `hf repos branch create REPO_ID BRANCH` — Create a new branch for a repo on the Hub. `[--revision TEXT --type CHOICE --exist-ok]`
|
| 133 |
+
- `hf repos branch delete REPO_ID BRANCH` — Delete a branch from a repo on the Hub. `[--type CHOICE]`
|
| 134 |
+
- `hf repos create REPO_ID` — Create a new repo on the Hub. `[--type CHOICE --space-sdk TEXT --private --public --protected --exist-ok --resource-group-id TEXT --flavor TEXT --storage TEXT --sleep-time INTEGER --secrets TEXT --secrets-file TEXT --env TEXT --env-file TEXT]`
|
| 135 |
+
- `hf repos delete REPO_ID` — Delete a repo from the Hub. This is an irreversible operation. `[--type CHOICE --missing-ok]`
|
| 136 |
+
- `hf repos delete-files REPO_ID PATTERNS` — Delete files from a repo on the Hub. `[--type CHOICE --revision TEXT --commit-message TEXT --commit-description TEXT --create-pr]`
|
| 137 |
+
- `hf repos duplicate FROM_ID` — Duplicate a repo on the Hub (model, dataset, or Space). `[--type CHOICE --private --public --protected --exist-ok --flavor TEXT --storage TEXT --sleep-time INTEGER --secrets TEXT --secrets-file TEXT --env TEXT --env-file TEXT]`
|
| 138 |
+
- `hf repos move FROM_ID TO_ID` — Move a repository from a namespace to another namespace. `[--type CHOICE]`
|
| 139 |
+
- `hf repos settings REPO_ID` — Update the settings of a repository. `[--gated CHOICE --private --public --protected --type CHOICE]`
|
| 140 |
+
- `hf repos tag create REPO_ID TAG` — Create a tag for a repo. `[--message TEXT --revision TEXT --type CHOICE]`
|
| 141 |
+
- `hf repos tag delete REPO_ID TAG` — Delete a tag for a repo. `[--yes --type CHOICE]`
|
| 142 |
+
- `hf repos tag list REPO_ID` — List tags for a repo. `[--type CHOICE]`
|
| 143 |
+
|
| 144 |
+
### `hf skills` — Manage skills for AI assistants.
|
| 145 |
+
|
| 146 |
+
- `hf skills add` — Download a skill and install it for an AI assistant. `[--claude --codex --cursor --opencode --global --dest PATH --force]`
|
| 147 |
+
- `hf skills preview` — Print the generated SKILL.md to stdout.
|
| 148 |
+
|
| 149 |
+
### `hf spaces` — Interact with spaces on the Hub.
|
| 150 |
+
|
| 151 |
+
- `hf spaces dev-mode SPACE_ID` — Enable or disable dev mode on a Space. `[--stop]`
|
| 152 |
+
- `hf spaces hot-reload SPACE_ID` — Hot-reload any Python file of a Space without a full rebuild + restart. `[--local-file TEXT --skip-checks --skip-summary]`
|
| 153 |
+
- `hf spaces info SPACE_ID` — Get info about a space on the Hub. Output is in JSON format. `[--revision TEXT --expand TEXT]`
|
| 154 |
+
- `hf spaces list` — List spaces on the Hub. `[--search TEXT --author TEXT --filter TEXT --sort CHOICE --limit INTEGER --expand TEXT --format CHOICE --quiet]`
|
| 155 |
+
|
| 156 |
+
### `hf webhooks` — Manage webhooks on the Hub.
|
| 157 |
+
|
| 158 |
+
- `hf webhooks create --watch TEXT` — Create a new webhook. `[--url TEXT --job-id TEXT --domain CHOICE --secret TEXT]`
|
| 159 |
+
- `hf webhooks delete WEBHOOK_ID` — Delete a webhook permanently. `[--yes]`
|
| 160 |
+
- `hf webhooks disable WEBHOOK_ID` — Disable an active webhook.
|
| 161 |
+
- `hf webhooks enable WEBHOOK_ID` — Enable a disabled webhook.
|
| 162 |
+
- `hf webhooks info WEBHOOK_ID` — Show full details for a single webhook as JSON.
|
| 163 |
+
- `hf webhooks list` — List all webhooks for the current user. `[--format CHOICE --quiet]`
|
| 164 |
+
- `hf webhooks update WEBHOOK_ID` — Update an existing webhook. Only provided options are changed. `[--url TEXT --watch TEXT --domain CHOICE --secret TEXT]`
|
| 165 |
+
|
| 166 |
+
## Common options
|
| 167 |
+
|
| 168 |
+
- `--format` — Output format: `--format json` (or `--json`) or `--format table` (default).
|
| 169 |
+
- `-q / --quiet` — Minimal output.
|
| 170 |
+
- `--revision` — Git revision id which can be a branch name, a tag, or a commit hash.
|
| 171 |
+
- `--token` — Use a User Access Token. Prefer setting `HF_TOKEN` env var instead of passing `--token`.
|
| 172 |
+
- `--type` — The type of repository (model, dataset, or space).
|
| 173 |
+
|
| 174 |
+
## Mounting repos as local filesystems
|
| 175 |
+
|
| 176 |
+
To mount Hub repositories or buckets as local filesystems — no download, no copy, no waiting — use `hf-mount`. Files are fetched on demand. GitHub: https://github.com/huggingface/hf-mount
|
| 177 |
+
|
| 178 |
+
Install: `curl -fsSL https://raw.githubusercontent.com/huggingface/hf-mount/main/install.sh | sh`
|
| 179 |
+
|
| 180 |
+
Some command examples:
|
| 181 |
+
- `hf-mount start repo openai-community/gpt2 /tmp/gpt2` — mount a repo (read-only)
|
| 182 |
+
- `hf-mount start --hf-token $HF_TOKEN bucket myuser/my-bucket /tmp/data` — mount a bucket (read-write)
|
| 183 |
+
- `hf-mount status` / `hf-mount stop /tmp/data` — list or unmount
|
| 184 |
+
|
| 185 |
+
## Tips
|
| 186 |
+
|
| 187 |
+
- Use `hf <command> --help` for full options, descriptions, usage, and real-world examples
|
| 188 |
+
- Authenticate with `HF_TOKEN` env var (recommended) or with `--token`
|
.agents/skills/reinforcement-learning/SKILL.md
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: reinforcement-learning
|
| 3 |
+
description: Use when implementing RL algorithms, training agents with rewards, or aligning LLMs with human feedback - covers policy gradients, PPO, Q-learning, RLHF, and GRPOUse when ", " mentioned.
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Reinforcement Learning
|
| 7 |
+
|
| 8 |
+
## Identity
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
## Reference System Usage
|
| 13 |
+
|
| 14 |
+
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
|
| 15 |
+
|
| 16 |
+
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
|
| 17 |
+
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
|
| 18 |
+
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
|
| 19 |
+
|
| 20 |
+
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
|
.agents/skills/reinforcement-learning/references/patterns.md
ADDED
|
@@ -0,0 +1,183 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reinforcement Learning
|
| 2 |
+
|
| 3 |
+
## Patterns
|
| 4 |
+
|
| 5 |
+
### **Golden Rules**
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
##### **Rule**
|
| 9 |
+
Reward shaping is critical
|
| 10 |
+
##### **Reason**
|
| 11 |
+
Sparse rewards make learning nearly impossible
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
##### **Rule**
|
| 15 |
+
Start simple, scale up
|
| 16 |
+
##### **Reason**
|
| 17 |
+
Debug on toy environments before complex ones
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
##### **Rule**
|
| 21 |
+
Monitor training metrics obsessively
|
| 22 |
+
##### **Reason**
|
| 23 |
+
RL training is notoriously unstable
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
##### **Rule**
|
| 27 |
+
Use appropriate baselines
|
| 28 |
+
##### **Reason**
|
| 29 |
+
Reduces variance in policy gradients
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
##### **Rule**
|
| 33 |
+
Clip/constrain policy updates
|
| 34 |
+
##### **Reason**
|
| 35 |
+
Prevents catastrophic policy collapse
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
##### **Rule**
|
| 39 |
+
Separate exploration from exploitation
|
| 40 |
+
##### **Reason**
|
| 41 |
+
Ensures sufficient state-space coverage
|
| 42 |
+
### **Algorithm Taxonomy**
|
| 43 |
+
#### **Value Based**
|
| 44 |
+
##### **Algorithms**
|
| 45 |
+
- Q-Learning
|
| 46 |
+
- DQN
|
| 47 |
+
- Double DQN
|
| 48 |
+
- Dueling DQN
|
| 49 |
+
##### **Learns**
|
| 50 |
+
Q(s,a) - Value of state-action pairs
|
| 51 |
+
##### **Best For**
|
| 52 |
+
- Discrete actions
|
| 53 |
+
- Atari games
|
| 54 |
+
#### **Policy Based**
|
| 55 |
+
##### **Algorithms**
|
| 56 |
+
- REINFORCE
|
| 57 |
+
- Policy Gradient
|
| 58 |
+
##### **Learns**
|
| 59 |
+
pi(a|s) - Policy directly
|
| 60 |
+
##### **Best For**
|
| 61 |
+
- Continuous actions
|
| 62 |
+
- Robotics
|
| 63 |
+
#### **Actor Critic**
|
| 64 |
+
##### **Algorithms**
|
| 65 |
+
- A2C/A3C
|
| 66 |
+
- PPO
|
| 67 |
+
- SAC
|
| 68 |
+
- TRPO
|
| 69 |
+
##### **Learns**
|
| 70 |
+
Both V and pi
|
| 71 |
+
##### **Best For**
|
| 72 |
+
- Most tasks
|
| 73 |
+
- LLM alignment
|
| 74 |
+
### **On Vs Off Policy**
|
| 75 |
+
#### **On Policy**
|
| 76 |
+
##### **Algorithms**
|
| 77 |
+
- PPO
|
| 78 |
+
- A2C
|
| 79 |
+
##### **Property**
|
| 80 |
+
Learn from current policy samples
|
| 81 |
+
##### **Pros**
|
| 82 |
+
More stable
|
| 83 |
+
##### **Cons**
|
| 84 |
+
Fresh data required
|
| 85 |
+
#### **Off Policy**
|
| 86 |
+
##### **Algorithms**
|
| 87 |
+
- DQN
|
| 88 |
+
- SAC
|
| 89 |
+
##### **Property**
|
| 90 |
+
Learn from any policy samples
|
| 91 |
+
##### **Pros**
|
| 92 |
+
More sample efficient
|
| 93 |
+
##### **Cons**
|
| 94 |
+
Requires replay buffer
|
| 95 |
+
### **Discount Factor**
|
| 96 |
+
#### **Short Horizon**
|
| 97 |
+
|
| 98 |
+
#### **Medium Horizon**
|
| 99 |
+
|
| 100 |
+
#### **Long Horizon**
|
| 101 |
+
|
| 102 |
+
#### **Infinite Horizon**
|
| 103 |
+
|
| 104 |
+
### **Ppo Config**
|
| 105 |
+
#### **Clip Epsilon**
|
| 106 |
+
0.1-0.3 (typically 0.2)
|
| 107 |
+
#### **Entropy Coef**
|
| 108 |
+
0.01 (encourages exploration)
|
| 109 |
+
#### **Value Coef**
|
| 110 |
+
0.5
|
| 111 |
+
#### **Max Grad Norm**
|
| 112 |
+
0.5
|
| 113 |
+
#### **N Epochs**
|
| 114 |
+
3-10 per batch
|
| 115 |
+
### **Rlhf Pipeline**
|
| 116 |
+
#### **Step1 Sft**
|
| 117 |
+
##### **Description**
|
| 118 |
+
Supervised Fine-Tuning
|
| 119 |
+
##### **Purpose**
|
| 120 |
+
Establish baseline helpful behavior
|
| 121 |
+
#### **Step2 Reward Model**
|
| 122 |
+
##### **Description**
|
| 123 |
+
Train on human preference comparisons
|
| 124 |
+
##### **Output**
|
| 125 |
+
Reward(prompt, response) = scalar
|
| 126 |
+
##### **Loss**
|
| 127 |
+
Bradley-Terry: -log(sigmoid(r_chosen - r_rejected))
|
| 128 |
+
#### **Step3 Ppo**
|
| 129 |
+
##### **Description**
|
| 130 |
+
Optimize policy with KL penalty
|
| 131 |
+
##### **Formula**
|
| 132 |
+
reward = r(x,y) - beta * KL(pi || pi_ref)
|
| 133 |
+
|
| 134 |
+
## Anti-Patterns
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
#### **Pattern**
|
| 139 |
+
Sparse rewards
|
| 140 |
+
#### **Problem**
|
| 141 |
+
Agent learns nothing
|
| 142 |
+
#### **Solution**
|
| 143 |
+
Reward shaping, dense rewards
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
#### **Pattern**
|
| 147 |
+
No baseline/advantage
|
| 148 |
+
#### **Problem**
|
| 149 |
+
High variance gradients
|
| 150 |
+
#### **Solution**
|
| 151 |
+
Use GAE, value baseline
|
| 152 |
+
|
| 153 |
+
---
|
| 154 |
+
#### **Pattern**
|
| 155 |
+
Large policy updates
|
| 156 |
+
#### **Problem**
|
| 157 |
+
Training collapse
|
| 158 |
+
#### **Solution**
|
| 159 |
+
PPO clipping, KL penalty
|
| 160 |
+
|
| 161 |
+
---
|
| 162 |
+
#### **Pattern**
|
| 163 |
+
No replay buffer (off-policy)
|
| 164 |
+
#### **Problem**
|
| 165 |
+
Sample inefficiency
|
| 166 |
+
#### **Solution**
|
| 167 |
+
Experience replay
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
#### **Pattern**
|
| 171 |
+
Same network for Q and target
|
| 172 |
+
#### **Problem**
|
| 173 |
+
Unstable learning
|
| 174 |
+
#### **Solution**
|
| 175 |
+
Separate target network
|
| 176 |
+
|
| 177 |
+
---
|
| 178 |
+
#### **Pattern**
|
| 179 |
+
Ignoring KL in RLHF
|
| 180 |
+
#### **Problem**
|
| 181 |
+
Model drift, reward hacking
|
| 182 |
+
#### **Solution**
|
| 183 |
+
KL penalty to reference model
|
.agents/skills/reinforcement-learning/references/sharp_edges.md
ADDED
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reinforcement Learning - Sharp Edges
|
| 2 |
+
|
| 3 |
+
## Reward Hacking in RLHF
|
| 4 |
+
|
| 5 |
+
### **Id**
|
| 6 |
+
reward-hacking
|
| 7 |
+
### **Severity**
|
| 8 |
+
critical
|
| 9 |
+
### **Summary**
|
| 10 |
+
Model finds exploits in reward model instead of being helpful
|
| 11 |
+
### **Symptoms**
|
| 12 |
+
- Reward score increases but quality decreases
|
| 13 |
+
- Model produces verbose but unhelpful responses
|
| 14 |
+
- Responses game the reward model's biases
|
| 15 |
+
- Human evaluators disagree with high reward scores
|
| 16 |
+
### **Why**
|
| 17 |
+
The reward model is an imperfect proxy for human preferences.
|
| 18 |
+
Given enough optimization pressure, the policy finds reward model exploits.
|
| 19 |
+
Common exploits: verbosity, sycophancy, specific phrases reward model likes.
|
| 20 |
+
|
| 21 |
+
### **Gotcha**
|
| 22 |
+
# Optimizing reward too aggressively
|
| 23 |
+
for step in range(1000000):
|
| 24 |
+
reward = reward_model(response)
|
| 25 |
+
loss = -reward # Pure reward maximization
|
| 26 |
+
loss.backward()
|
| 27 |
+
# Model learns to game reward model
|
| 28 |
+
|
| 29 |
+
### **Solution**
|
| 30 |
+
# 1. KL penalty to stay close to reference
|
| 31 |
+
reward = reward_model(response) - kl_coef * kl_divergence(policy, reference)
|
| 32 |
+
|
| 33 |
+
# 2. Periodically refresh reward model on new data
|
| 34 |
+
# 3. Ensemble multiple reward models
|
| 35 |
+
# 4. Human evaluation checkpoints
|
| 36 |
+
|
| 37 |
+
# 5. Early stopping based on held-out evaluation
|
| 38 |
+
if eval_score < best_score - tolerance:
|
| 39 |
+
break # Stop before overfitting to reward model
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
## Catastrophic Policy Collapse
|
| 43 |
+
|
| 44 |
+
### **Id**
|
| 45 |
+
policy-collapse
|
| 46 |
+
### **Severity**
|
| 47 |
+
critical
|
| 48 |
+
### **Summary**
|
| 49 |
+
Policy suddenly degenerates after seeming stable
|
| 50 |
+
### **Symptoms**
|
| 51 |
+
- Entropy drops to near zero
|
| 52 |
+
- Policy outputs become deterministic/repetitive
|
| 53 |
+
- Reward suddenly crashes
|
| 54 |
+
- All samples look identical
|
| 55 |
+
### **Why**
|
| 56 |
+
Without proper constraints, policy gradient updates can be too large.
|
| 57 |
+
A large bad update can push the policy into a degenerate state.
|
| 58 |
+
From there, all samples reinforce the bad behavior.
|
| 59 |
+
|
| 60 |
+
### **Gotcha**
|
| 61 |
+
# REINFORCE without clipping
|
| 62 |
+
ratio = new_prob / old_prob
|
| 63 |
+
loss = -ratio * advantage # No limit on ratio!
|
| 64 |
+
# If ratio >> 1, can destroy the policy
|
| 65 |
+
|
| 66 |
+
### **Solution**
|
| 67 |
+
# PPO clipping prevents catastrophic updates
|
| 68 |
+
ratio = torch.exp(new_log_prob - old_log_prob)
|
| 69 |
+
|
| 70 |
+
surr1 = ratio * advantage
|
| 71 |
+
surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantage
|
| 72 |
+
|
| 73 |
+
loss = -torch.min(surr1, surr2).mean()
|
| 74 |
+
|
| 75 |
+
# Also: monitor entropy, add entropy bonus
|
| 76 |
+
entropy_bonus = -entropy_coef * entropy.mean()
|
| 77 |
+
total_loss = loss + entropy_bonus
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
## Agent Never Learns Due to Sparse Rewards
|
| 81 |
+
|
| 82 |
+
### **Id**
|
| 83 |
+
sparse-reward-failure
|
| 84 |
+
### **Severity**
|
| 85 |
+
high
|
| 86 |
+
### **Summary**
|
| 87 |
+
Reward signal too rare for learning to occur
|
| 88 |
+
### **Symptoms**
|
| 89 |
+
- Agent takes random actions indefinitely
|
| 90 |
+
- No improvement over random baseline
|
| 91 |
+
- Policy gradient has near-zero signal
|
| 92 |
+
### **Why**
|
| 93 |
+
If reward only comes at episode end (or rarely), the agent gets no
|
| 94 |
+
feedback about which intermediate actions were good.
|
| 95 |
+
Credit assignment becomes impossible.
|
| 96 |
+
|
| 97 |
+
### **Gotcha**
|
| 98 |
+
# Sparse reward environment
|
| 99 |
+
def step(action):
|
| 100 |
+
# Only reward at the very end
|
| 101 |
+
if is_goal_reached():
|
| 102 |
+
return observation, 1.0, True, {} # Reward only here
|
| 103 |
+
return observation, 0.0, False, {} # No intermediate signal
|
| 104 |
+
|
| 105 |
+
### **Solution**
|
| 106 |
+
# 1. Reward shaping - add intermediate rewards
|
| 107 |
+
def shaped_reward(state, action, next_state):
|
| 108 |
+
sparse = 1.0 if is_goal_reached(next_state) else 0.0
|
| 109 |
+
|
| 110 |
+
# Potential-based shaping (preserves optimal policy)
|
| 111 |
+
potential_diff = gamma * potential(next_state) - potential(state)
|
| 112 |
+
|
| 113 |
+
return sparse + shaping_coef * potential_diff
|
| 114 |
+
|
| 115 |
+
# 2. Curiosity-driven exploration
|
| 116 |
+
# 3. Hierarchical RL with subgoals
|
| 117 |
+
# 4. Curriculum learning - start with easier tasks
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
## Q-Value Overestimation in DQN
|
| 121 |
+
|
| 122 |
+
### **Id**
|
| 123 |
+
value-function-overestimation
|
| 124 |
+
### **Severity**
|
| 125 |
+
high
|
| 126 |
+
### **Summary**
|
| 127 |
+
Q-learning systematically overestimates values
|
| 128 |
+
### **Symptoms**
|
| 129 |
+
- Q-values grow unrealistically large
|
| 130 |
+
- Agent is overconfident about bad actions
|
| 131 |
+
- Performance is worse than expected from Q-values
|
| 132 |
+
### **Why**
|
| 133 |
+
max_a Q(s,a) takes the maximum over noisy estimates.
|
| 134 |
+
This systematically picks the action with the highest positive noise.
|
| 135 |
+
Over many updates, this bias compounds.
|
| 136 |
+
|
| 137 |
+
### **Gotcha**
|
| 138 |
+
# Standard DQN - has overestimation bias
|
| 139 |
+
target_q = reward + gamma * target_net(next_state).max()
|
| 140 |
+
# max() selects the noisiest high estimate
|
| 141 |
+
|
| 142 |
+
### **Solution**
|
| 143 |
+
# Double DQN - use online net to select, target net to evaluate
|
| 144 |
+
next_actions = online_net(next_state).argmax(dim=1)
|
| 145 |
+
target_q = reward + gamma * target_net(next_state).gather(1, next_actions)
|
| 146 |
+
|
| 147 |
+
# The action selection and value estimation use different networks
|
| 148 |
+
# This breaks the overestimation cycle
|
| 149 |
+
|
| 150 |
+
|
| 151 |
+
## KL Divergence Explodes During RLHF
|
| 152 |
+
|
| 153 |
+
### **Id**
|
| 154 |
+
kl-divergence-explosion
|
| 155 |
+
### **Severity**
|
| 156 |
+
high
|
| 157 |
+
### **Summary**
|
| 158 |
+
Policy drifts too far from reference model
|
| 159 |
+
### **Symptoms**
|
| 160 |
+
- KL penalty term dominates the loss
|
| 161 |
+
- Model forgets base capabilities
|
| 162 |
+
- Responses become incoherent
|
| 163 |
+
- Generation quality degrades
|
| 164 |
+
### **Why**
|
| 165 |
+
Without proper KL constraint, the policy can drift arbitrarily far.
|
| 166 |
+
The reference model represents the base capabilities we want to preserve.
|
| 167 |
+
Drifting too far means catastrophic forgetting.
|
| 168 |
+
|
| 169 |
+
### **Gotcha**
|
| 170 |
+
# KL coefficient too low
|
| 171 |
+
kl_coef = 0.001 # Too weak!
|
| 172 |
+
reward = reward_score - kl_coef * kl # Barely constrains
|
| 173 |
+
|
| 174 |
+
### **Solution**
|
| 175 |
+
# 1. Appropriate KL coefficient (0.1 - 0.5 typical)
|
| 176 |
+
kl_coef = 0.1
|
| 177 |
+
|
| 178 |
+
# 2. Adaptive KL penalty
|
| 179 |
+
if kl > target_kl * 1.5:
|
| 180 |
+
kl_coef *= 1.5
|
| 181 |
+
elif kl < target_kl / 1.5:
|
| 182 |
+
kl_coef /= 1.5
|
| 183 |
+
|
| 184 |
+
# 3. Hard KL constraint (TRPO-style)
|
| 185 |
+
if kl > max_kl:
|
| 186 |
+
reject_update()
|
| 187 |
+
|
.agents/skills/reinforcement-learning/references/validations.md
ADDED
|
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reinforcement Learning - Validations
|
| 2 |
+
|
| 3 |
+
## PPO Without Clipping
|
| 4 |
+
|
| 5 |
+
### **Id**
|
| 6 |
+
ppo-no-clipping
|
| 7 |
+
### **Severity**
|
| 8 |
+
error
|
| 9 |
+
### **Type**
|
| 10 |
+
regex
|
| 11 |
+
### **Pattern**
|
| 12 |
+
- ratio.*advantage(?!.*clamp|clip)
|
| 13 |
+
- policy_loss.*=.*-.*ratio.*advantage(?!.*min)
|
| 14 |
+
### **Message**
|
| 15 |
+
PPO requires clipping to prevent catastrophic policy updates.
|
| 16 |
+
### **Fix Action**
|
| 17 |
+
Add: torch.clamp(ratio, 1-eps, 1+eps) and use min of clipped/unclipped
|
| 18 |
+
### **Applies To**
|
| 19 |
+
- **/*.py
|
| 20 |
+
|
| 21 |
+
## Advantages Not Normalized
|
| 22 |
+
|
| 23 |
+
### **Id**
|
| 24 |
+
no-advantage-normalization
|
| 25 |
+
### **Severity**
|
| 26 |
+
warning
|
| 27 |
+
### **Type**
|
| 28 |
+
regex
|
| 29 |
+
### **Pattern**
|
| 30 |
+
- advantage.*=(?!.*(mean|std|normalize))
|
| 31 |
+
### **Message**
|
| 32 |
+
Normalizing advantages reduces variance and improves training stability.
|
| 33 |
+
### **Fix Action**
|
| 34 |
+
Add: advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
|
| 35 |
+
### **Applies To**
|
| 36 |
+
- **/*ppo*.py
|
| 37 |
+
- **/*rl*.py
|
| 38 |
+
|
| 39 |
+
## Missing Entropy Bonus
|
| 40 |
+
|
| 41 |
+
### **Id**
|
| 42 |
+
no-entropy-bonus
|
| 43 |
+
### **Severity**
|
| 44 |
+
warning
|
| 45 |
+
### **Type**
|
| 46 |
+
regex
|
| 47 |
+
### **Pattern**
|
| 48 |
+
- policy_loss(?!.*entropy)
|
| 49 |
+
- actor_loss(?!.*entropy)
|
| 50 |
+
### **Message**
|
| 51 |
+
Entropy bonus encourages exploration and prevents premature convergence.
|
| 52 |
+
### **Fix Action**
|
| 53 |
+
Add: total_loss = policy_loss - entropy_coef * entropy.mean()
|
| 54 |
+
### **Applies To**
|
| 55 |
+
- **/*ppo*.py
|
| 56 |
+
- **/*a2c*.py
|
| 57 |
+
|
| 58 |
+
## RLHF Without KL Penalty
|
| 59 |
+
|
| 60 |
+
### **Id**
|
| 61 |
+
rlhf-no-kl-penalty
|
| 62 |
+
### **Severity**
|
| 63 |
+
error
|
| 64 |
+
### **Type**
|
| 65 |
+
regex
|
| 66 |
+
### **Pattern**
|
| 67 |
+
- reward_model.*response(?!.*kl|.*reference)
|
| 68 |
+
### **Message**
|
| 69 |
+
RLHF requires KL penalty to prevent model drift and reward hacking.
|
| 70 |
+
### **Fix Action**
|
| 71 |
+
Add: reward = reward_score - kl_coef * kl_divergence(policy, reference)
|
| 72 |
+
### **Applies To**
|
| 73 |
+
- **/*rlhf*.py
|
| 74 |
+
- **/*alignment*.py
|
| 75 |
+
|
| 76 |
+
## DQN Without Target Network
|
| 77 |
+
|
| 78 |
+
### **Id**
|
| 79 |
+
dqn-no-target-network
|
| 80 |
+
### **Severity**
|
| 81 |
+
error
|
| 82 |
+
### **Type**
|
| 83 |
+
regex
|
| 84 |
+
### **Pattern**
|
| 85 |
+
- q_network.*max(?!.*target)
|
| 86 |
+
- q_net.*next_state(?!.*target)
|
| 87 |
+
### **Message**
|
| 88 |
+
DQN requires separate target network for stable learning.
|
| 89 |
+
### **Fix Action**
|
| 90 |
+
Add target network and periodically update: target_net.load_state_dict(q_net.state_dict())
|
| 91 |
+
### **Applies To**
|
| 92 |
+
- **/*dqn*.py
|
| 93 |
+
- **/*q_learning*.py
|
| 94 |
+
|
| 95 |
+
## RL Training Without Gradient Clipping
|
| 96 |
+
|
| 97 |
+
### **Id**
|
| 98 |
+
no-gradient-clipping-rl
|
| 99 |
+
### **Severity**
|
| 100 |
+
warning
|
| 101 |
+
### **Type**
|
| 102 |
+
regex
|
| 103 |
+
### **Pattern**
|
| 104 |
+
- loss\.backward\(\)\s*\n\s*optimizer\.step(?!.*clip_grad)
|
| 105 |
+
### **Message**
|
| 106 |
+
RL training benefits from gradient clipping for stability.
|
| 107 |
+
### **Fix Action**
|
| 108 |
+
Add: nn.utils.clip_grad_norm_(parameters, max_grad_norm)
|
| 109 |
+
### **Applies To**
|
| 110 |
+
- **/*rl*.py
|
| 111 |
+
- **/*ppo*.py
|
| 112 |
+
|
| 113 |
+
## Training Without Reward Logging
|
| 114 |
+
|
| 115 |
+
### **Id**
|
| 116 |
+
no-reward-logging
|
| 117 |
+
### **Severity**
|
| 118 |
+
info
|
| 119 |
+
### **Type**
|
| 120 |
+
regex
|
| 121 |
+
### **Pattern**
|
| 122 |
+
- for.*episode(?!.*log|.*print|.*wandb|.*writer)
|
| 123 |
+
### **Message**
|
| 124 |
+
RL training requires careful monitoring of reward and metrics.
|
| 125 |
+
### **Fix Action**
|
| 126 |
+
Log: episode_reward, policy_loss, value_loss, entropy, KL divergence
|
| 127 |
+
### **Applies To**
|
| 128 |
+
- **/*train*.py
|
| 129 |
+
- **/*rl*.py
|
.gitignore
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
.env
|
skills-lock.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"version": 1,
|
| 3 |
+
"skills": {
|
| 4 |
+
"frontend-design": {
|
| 5 |
+
"source": "anthropics/skills",
|
| 6 |
+
"sourceType": "github",
|
| 7 |
+
"computedHash": "516bd2154eb843a8240e43d5b285229129853114ad7075a5e141e1c08e408c84"
|
| 8 |
+
},
|
| 9 |
+
"hf-cli": {
|
| 10 |
+
"source": "huggingface/skills",
|
| 11 |
+
"sourceType": "github",
|
| 12 |
+
"computedHash": "a6b2e303e6a15ef21f3e041e622733a632c123f2a7ca2074e2a1f0d7a911dc36"
|
| 13 |
+
},
|
| 14 |
+
"reinforcement-learning": {
|
| 15 |
+
"source": "omer-metin/skills-for-antigravity",
|
| 16 |
+
"sourceType": "github",
|
| 17 |
+
"computedHash": "b2c8580ea8ae26f33b5cbb9a581778a7c9037b4e65d903f0458395ed006dc5da"
|
| 18 |
+
}
|
| 19 |
+
}
|
| 20 |
+
}
|