Skip to content

taneltaluri/evolve-skill

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

evolve-skill

Optimize your Agent Skills the way ML teams optimize models — with a measured loss function you actually trust.

Inspired by Karpathy's autoresearch and the darwin-skill project, rebuilt around one conviction:

If you cannot trust your measurement, you cannot trust your optimization.

Every keep/revert decision in evolve-skill is gated by whether the measurement is reliable enough to justify it.


Why This Exists

Auto-optimizing SKILL.md files is a great idea. Most implementations have three failure modes that quietly make skills worse over time:

Failure mode What happens How evolve-skill blocks it
Scoring noise Different scorers give wildly different scores; "improvements" are just lucky draws Gate 1 — three-run variance check, SD ≤ 2.0 required
Overfitting The optimizer learns to game the test prompts Train/holdout split; Phase 4 holdout validation
Function drift The score goes up but capabilities silently disappear Gate 3 — semantic diff, ≥80% overlap required

evolve-skill treats all three as first-class concerns, not afterthoughts.


The Three Gates

Gate 1 — Measurement Stability. Score the baseline three times with three independent sub-agents. If the standard deviation exceeds 2 points, the rubric is too loose; tighten anchors before optimizing.

Gate 2 — Effect Size. A change is only kept if Δscore ≥ max(3, 2×SD). Anything smaller is within measurement noise.

Gate 3 — Function Preservation. Extract the skill's core functions before and after. If overlap drops below 80%, reject the change no matter how much the score "improved".


The 9-Dimension Anchored Rubric

Every dimension has three anchor points (1, 5, 10) written out in full in rubric/anchored-rubric.md. Scorers must cite which anchor their score is closest to. That's what keeps inter-rater agreement tight.

Group Weight Dimensions
Structure 50 Frontmatter · Workflow · Edge cases · Checkpoints · Specificity · Resources
Effectiveness 40 Architectural fit · Live test performance (30)
Localization 10 Cross-language + cross-OS fidelity

Live test performance is the heaviest single weight by design: a beautifully structured skill that produces bad output is still a bad skill.


Workflow at a Glance

Phase 0  — Scope + branch + load playbook
Phase 1  — User writes training + holdout prompts
Phase 2  — Baseline × 3 runs; Gate 1 (stability check)
Phase 3  — Hill-climb loop with bandit budget allocation
              ├─ consult playbook
              ├─ one change per round
              ├─ independent re-score
              ├─ Gate 2 (effect size)
              ├─ Gate 3 (function preservation)
              └─ experiment on its own branch (always preserved)
Phase 4  — Unseal holdouts; check for overfitting
Phase 5  — Update playbook; final report

See examples/example-walkthrough.md for a complete cycle with real numbers.


Install

# Place the folder into your Claude skills directory
cp -r evolve-skill ~/.claude/skills/

# Or link it
ln -s $(pwd)/evolve-skill ~/.claude/skills/evolve-skill

Then invoke:

"optimize all my skills"
"optimize my blog-writer skill"
"rate all my skills"                # evaluation only, no edits
"is my rubric reliable?"            # Gate 1 check only

What's Inside

evolve-skill/
├── SKILL.md                      # the skill itself
├── rubric/
│   ├── anchored-rubric.md        # 9 dimensions × 3 anchors each
│   └── rubric-v1.json            # machine-readable
├── scripts/
│   ├── calibrate.py              # Gate 1 — measurement stability
│   ├── score.py                  # apply rubric with anchors
│   ├── semantic_diff.py          # Gate 3 — function preservation
│   ├── ratchet.py                # Gate 2 — effect-size decision
│   └── allocate.py               # bandit budget allocation
├── templates/
│   ├── test-prompts.template.json
│   ├── playbook.template.md
│   └── result-card.html
├── examples/
│   ├── example-walkthrough.md
│   └── example-results.tsv
└── docs/
    ├── philosophy.md
    └── migrating-from-darwin.md

Every Python script runs standalone with clear I/O; the skill itself drives them through sub-agents or the user.


Meta-Metrics — How You Know This Is Working

Over a full optimization run, healthy signals are:

  • < 20% of candidate changes rejected by Gate 2 (effect size too small) — anchors give useful leverage
  • < 5% rejected by Gate 3 (function drift) — the editor is disciplined
  • ≥ 90% direction agreement between training delta and holdout delta — no significant overfitting
  • Playbook patterns reused later produce positive Δ — crosstalk works

These meta-metrics are tracked in results.tsv over time. They tell you whether the optimizer itself is healthy, not just whether individual optimizations worked.


How This Differs from darwin-skill

Darwin is the direct ancestor and gets credit for mapping autoresearch onto SKILL.md optimization. evolve-skill keeps the backbone (ratchet, dual evaluation, human-in-loop) and adds measurement discipline:

darwin-skill evolve-skill
Rubric "Score 1–10 per dimension" Anchored 1/5/10 per dimension
Measurement noise Unmeasured Gate 1: three-run SD ≤ 2.0
Keep threshold new > old Δ ≥ max(3, 2×SD)
Overfitting protection None Train/holdout split
Function drift Rule, not measurement Gate 3: semantic diff ≥ 80%
Budget allocation Fixed 3 rounds Multi-armed bandit
Failed experiments Git-reverted into main Preserved on feature branches
Learning across skills None Playbook crosstalk
Dry-run scores Used for decisions confidence=low blocks keep decisions

See docs/migrating-from-darwin.md for a migration guide.


Design Inspiration

  • Andrej Karpathy — autoresearch: autonomous experiment loops with a measurable objective
  • 花叔 / alchaincyf — darwin-skill: the original SKILL.md adaptation
  • Inter-rater agreement research: anchored rubrics produce tighter scores than open scales
  • Train/val/test discipline from ML: holdouts are not optional

The contribution here is not novelty. It's measurement discipline: the loss function itself must be validated, not assumed.


License

MIT. Fork, modify, ship. Attribution appreciated.

About

Measurement-disciplined optimizer for Claude Agent Skills — three-gate system (stability, effect size, function preservation), anchored rubric, train/holdout split. Inspired by Karpathy's autoresearch and alchaincyf's darwin-skill.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors