Optimize your Agent Skills the way ML teams optimize models — with a measured loss function you actually trust.
Inspired by Karpathy's autoresearch and the darwin-skill project, rebuilt around one conviction:
If you cannot trust your measurement, you cannot trust your optimization.
Every keep/revert decision in evolve-skill is gated by whether the measurement is reliable enough to justify it.
Auto-optimizing SKILL.md files is a great idea. Most implementations have three failure modes that quietly make skills worse over time:
| Failure mode | What happens | How evolve-skill blocks it |
|---|---|---|
| Scoring noise | Different scorers give wildly different scores; "improvements" are just lucky draws | Gate 1 — three-run variance check, SD ≤ 2.0 required |
| Overfitting | The optimizer learns to game the test prompts | Train/holdout split; Phase 4 holdout validation |
| Function drift | The score goes up but capabilities silently disappear | Gate 3 — semantic diff, ≥80% overlap required |
evolve-skill treats all three as first-class concerns, not afterthoughts.
Gate 1 — Measurement Stability. Score the baseline three times with three independent sub-agents. If the standard deviation exceeds 2 points, the rubric is too loose; tighten anchors before optimizing.
Gate 2 — Effect Size. A change is only kept if Δscore ≥ max(3, 2×SD). Anything smaller is within measurement noise.
Gate 3 — Function Preservation. Extract the skill's core functions before and after. If overlap drops below 80%, reject the change no matter how much the score "improved".
Every dimension has three anchor points (1, 5, 10) written out in full in rubric/anchored-rubric.md. Scorers must cite which anchor their score is closest to. That's what keeps inter-rater agreement tight.
| Group | Weight | Dimensions |
|---|---|---|
| Structure | 50 | Frontmatter · Workflow · Edge cases · Checkpoints · Specificity · Resources |
| Effectiveness | 40 | Architectural fit · Live test performance (30) |
| Localization | 10 | Cross-language + cross-OS fidelity |
Live test performance is the heaviest single weight by design: a beautifully structured skill that produces bad output is still a bad skill.
Phase 0 — Scope + branch + load playbook
Phase 1 — User writes training + holdout prompts
Phase 2 — Baseline × 3 runs; Gate 1 (stability check)
Phase 3 — Hill-climb loop with bandit budget allocation
├─ consult playbook
├─ one change per round
├─ independent re-score
├─ Gate 2 (effect size)
├─ Gate 3 (function preservation)
└─ experiment on its own branch (always preserved)
Phase 4 — Unseal holdouts; check for overfitting
Phase 5 — Update playbook; final report
See examples/example-walkthrough.md for a complete cycle with real numbers.
# Place the folder into your Claude skills directory
cp -r evolve-skill ~/.claude/skills/
# Or link it
ln -s $(pwd)/evolve-skill ~/.claude/skills/evolve-skillThen invoke:
"optimize all my skills"
"optimize my blog-writer skill"
"rate all my skills" # evaluation only, no edits
"is my rubric reliable?" # Gate 1 check only
evolve-skill/
├── SKILL.md # the skill itself
├── rubric/
│ ├── anchored-rubric.md # 9 dimensions × 3 anchors each
│ └── rubric-v1.json # machine-readable
├── scripts/
│ ├── calibrate.py # Gate 1 — measurement stability
│ ├── score.py # apply rubric with anchors
│ ├── semantic_diff.py # Gate 3 — function preservation
│ ├── ratchet.py # Gate 2 — effect-size decision
│ └── allocate.py # bandit budget allocation
├── templates/
│ ├── test-prompts.template.json
│ ├── playbook.template.md
│ └── result-card.html
├── examples/
│ ├── example-walkthrough.md
│ └── example-results.tsv
└── docs/
├── philosophy.md
└── migrating-from-darwin.md
Every Python script runs standalone with clear I/O; the skill itself drives them through sub-agents or the user.
Over a full optimization run, healthy signals are:
- < 20% of candidate changes rejected by Gate 2 (effect size too small) — anchors give useful leverage
- < 5% rejected by Gate 3 (function drift) — the editor is disciplined
- ≥ 90% direction agreement between training delta and holdout delta — no significant overfitting
- Playbook patterns reused later produce positive Δ — crosstalk works
These meta-metrics are tracked in results.tsv over time. They tell you whether the optimizer itself is healthy, not just whether individual optimizations worked.
Darwin is the direct ancestor and gets credit for mapping autoresearch onto SKILL.md optimization. evolve-skill keeps the backbone (ratchet, dual evaluation, human-in-loop) and adds measurement discipline:
| darwin-skill | evolve-skill | |
|---|---|---|
| Rubric | "Score 1–10 per dimension" | Anchored 1/5/10 per dimension |
| Measurement noise | Unmeasured | Gate 1: three-run SD ≤ 2.0 |
| Keep threshold | new > old |
Δ ≥ max(3, 2×SD) |
| Overfitting protection | None | Train/holdout split |
| Function drift | Rule, not measurement | Gate 3: semantic diff ≥ 80% |
| Budget allocation | Fixed 3 rounds | Multi-armed bandit |
| Failed experiments | Git-reverted into main | Preserved on feature branches |
| Learning across skills | None | Playbook crosstalk |
| Dry-run scores | Used for decisions | confidence=low blocks keep decisions |
See docs/migrating-from-darwin.md for a migration guide.
- Andrej Karpathy — autoresearch: autonomous experiment loops with a measurable objective
- 花叔 / alchaincyf — darwin-skill: the original SKILL.md adaptation
- Inter-rater agreement research: anchored rubrics produce tighter scores than open scales
- Train/val/test discipline from ML: holdouts are not optional
The contribution here is not novelty. It's measurement discipline: the loss function itself must be validated, not assumed.
MIT. Fork, modify, ship. Attribution appreciated.