evolve-skill

Optimize your Agent Skills the way ML teams optimize models — with a measured loss function you actually trust.

Inspired by Karpathy's autoresearch and the darwin-skill project, rebuilt around one conviction:

If you cannot trust your measurement, you cannot trust your optimization.

Every keep/revert decision in evolve-skill is gated by whether the measurement is reliable enough to justify it.

Why This Exists

Auto-optimizing SKILL.md files is a great idea. Most implementations have three failure modes that quietly make skills worse over time:

Failure mode	What happens	How `evolve-skill` blocks it
Scoring noise	Different scorers give wildly different scores; "improvements" are just lucky draws	Gate 1 — three-run variance check, SD ≤ 2.0 required
Overfitting	The optimizer learns to game the test prompts	Train/holdout split; Phase 4 holdout validation
Function drift	The score goes up but capabilities silently disappear	Gate 3 — semantic diff, ≥80% overlap required

evolve-skill treats all three as first-class concerns, not afterthoughts.

The Three Gates

Gate 1 — Measurement Stability. Score the baseline three times with three independent sub-agents. If the standard deviation exceeds 2 points, the rubric is too loose; tighten anchors before optimizing.

Gate 2 — Effect Size. A change is only kept if Δscore ≥ max(3, 2×SD). Anything smaller is within measurement noise.

Gate 3 — Function Preservation. Extract the skill's core functions before and after. If overlap drops below 80%, reject the change no matter how much the score "improved".

The 9-Dimension Anchored Rubric

Every dimension has three anchor points (1, 5, 10) written out in full in rubric/anchored-rubric.md. Scorers must cite which anchor their score is closest to. That's what keeps inter-rater agreement tight.

Group	Weight	Dimensions
Structure	50	Frontmatter · Workflow · Edge cases · Checkpoints · Specificity · Resources
Effectiveness	40	Architectural fit · Live test performance (30)
Localization	10	Cross-language + cross-OS fidelity

Live test performance is the heaviest single weight by design: a beautifully structured skill that produces bad output is still a bad skill.

Workflow at a Glance

Phase 0  — Scope + branch + load playbook
Phase 1  — User writes training + holdout prompts
Phase 2  — Baseline × 3 runs; Gate 1 (stability check)
Phase 3  — Hill-climb loop with bandit budget allocation
              ├─ consult playbook
              ├─ one change per round
              ├─ independent re-score
              ├─ Gate 2 (effect size)
              ├─ Gate 3 (function preservation)
              └─ experiment on its own branch (always preserved)
Phase 4  — Unseal holdouts; check for overfitting
Phase 5  — Update playbook; final report

See examples/example-walkthrough.md for a complete cycle with real numbers.

Install

# Place the folder into your Claude skills directory
cp -r evolve-skill ~/.claude/skills/

# Or link it
ln -s $(pwd)/evolve-skill ~/.claude/skills/evolve-skill

Then invoke:

"optimize all my skills"
"optimize my blog-writer skill"
"rate all my skills"                # evaluation only, no edits
"is my rubric reliable?"            # Gate 1 check only

What's Inside

evolve-skill/
├── SKILL.md                      # the skill itself
├── rubric/
│   ├── anchored-rubric.md        # 9 dimensions × 3 anchors each
│   └── rubric-v1.json            # machine-readable
├── scripts/
│   ├── calibrate.py              # Gate 1 — measurement stability
│   ├── score.py                  # apply rubric with anchors
│   ├── semantic_diff.py          # Gate 3 — function preservation
│   ├── ratchet.py                # Gate 2 — effect-size decision
│   └── allocate.py               # bandit budget allocation
├── templates/
│   ├── test-prompts.template.json
│   ├── playbook.template.md
│   └── result-card.html
├── examples/
│   ├── example-walkthrough.md
│   └── example-results.tsv
└── docs/
    ├── philosophy.md
    └── migrating-from-darwin.md

Every Python script runs standalone with clear I/O; the skill itself drives them through sub-agents or the user.

Meta-Metrics — How You Know This Is Working

Over a full optimization run, healthy signals are:

< 20% of candidate changes rejected by Gate 2 (effect size too small) — anchors give useful leverage
< 5% rejected by Gate 3 (function drift) — the editor is disciplined
≥ 90% direction agreement between training delta and holdout delta — no significant overfitting
Playbook patterns reused later produce positive Δ — crosstalk works

These meta-metrics are tracked in results.tsv over time. They tell you whether the optimizer itself is healthy, not just whether individual optimizations worked.

How This Differs from `darwin-skill`

Darwin is the direct ancestor and gets credit for mapping autoresearch onto SKILL.md optimization. evolve-skill keeps the backbone (ratchet, dual evaluation, human-in-loop) and adds measurement discipline:

	darwin-skill	evolve-skill
Rubric	"Score 1–10 per dimension"	Anchored 1/5/10 per dimension
Measurement noise	Unmeasured	Gate 1: three-run SD ≤ 2.0
Keep threshold	`new > old`	`Δ ≥ max(3, 2×SD)`
Overfitting protection	None	Train/holdout split
Function drift	Rule, not measurement	Gate 3: semantic diff ≥ 80%
Budget allocation	Fixed 3 rounds	Multi-armed bandit
Failed experiments	Git-reverted into main	Preserved on feature branches
Learning across skills	None	Playbook crosstalk
Dry-run scores	Used for decisions	`confidence=low` blocks keep decisions

See docs/migrating-from-darwin.md for a migration guide.

Design Inspiration

Andrej Karpathy — autoresearch: autonomous experiment loops with a measurable objective
花叔 / alchaincyf — darwin-skill: the original SKILL.md adaptation
Inter-rater agreement research: anchored rubrics produce tighter scores than open scales
Train/val/test discipline from ML: holdouts are not optional

The contribution here is not novelty. It's measurement discipline: the loss function itself must be validated, not assumed.

License

MIT. Fork, modify, ship. Attribution appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

evolve-skill

Why This Exists

The Three Gates

The 9-Dimension Anchored Rubric

Workflow at a Glance

Install

What's Inside

Meta-Metrics — How You Know This Is Working

How This Differs from `darwin-skill`

Design Inspiration

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
examples		examples
rubric		rubric
scripts		scripts
templates		templates
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md

Folders and files

Latest commit

History

Repository files navigation

evolve-skill

Why This Exists

The Three Gates

The 9-Dimension Anchored Rubric

Workflow at a Glance

Install

What's Inside

Meta-Metrics — How You Know This Is Working

How This Differs from darwin-skill

Design Inspiration

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

How This Differs from `darwin-skill`

Packages