Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions .claude/skills/mobius-evolve/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
---
name: mobius-evolve
description: Use when the user says "evolve", "mobius evolve", "improve agents", or wants to refine underperforming agents without API costs.
user-invocable: true
argument-hint: "[specialization] [--threshold 0.4] [--min-matches 3]"
---

# Mobius Evolve (Local Opus — Evaluator-Optimizer Loop)

**You ARE the evaluator-optimizer.** You are Claude Opus running locally — this costs $0 in API calls. Instead of calling the API to refine agents, YOU analyze the judge feedback, critique the current system prompt, and craft an improved version using the agentic-eval reflection pattern.

## Why this matters

The `mobius evolve` CLI command calls the Opus API to refine agents ($$$). But you (Claude Code Opus) are already running and are the same model. So:
- **This skill** = Opus refining agents for FREE (Pro subscription) with multi-pass self-critique
- **mobius evolve** = Opus refining agents for $$$ (API calls), single or multi-pass
- Same quality, zero cost, and you have full conversation context.

## What to do

### Step 1: Load underperformers

```bash
python .claude/skills/mobius-evolve/scripts/load_underperformers.py [specialization] [--threshold 0.4] [--min-matches 3]
```

This shows agents with low win rates, their current system prompts, and judge feedback from their losses.

### Step 2: Analyze and refine (YOU are the evaluator-optimizer loop)

For each underperformer, apply the **reflection pattern**:

**Evaluate:** Read the agent's current system prompt and the judge feedback. Identify the specific weaknesses the judges called out.

**Critique:** Ask yourself:
- What specific failure patterns does the feedback reveal?
- Is the system prompt too generic? Too narrow? Missing edge cases?
- Does it lack clear quality criteria or output format expectations?
- Would a different problem-solving approach help?

**Refine:** Write an improved system prompt that directly addresses each criticism. Be substantive — cosmetic rewording doesn't help.

**Self-check:** Before registering, verify your refinement:
- Does it address EVERY piece of judge feedback?
- Is it specific and opinionated (not generic)?
- Does it include quality criteria, methodology, and output format?
- Is it meaningfully different from the original, not just reworded?

If your self-check fails, iterate — refine again before registering.

### Step 3: Register the improved agent

Use the create_agent script to register the evolved version:

```bash
python .claude/skills/mobius-seed/scripts/create_agent.py '{
"name": "Agent Name v2",
"slug": "original-slug-v2",
"description": "Updated description",
"system_prompt": "Your improved system prompt here...",
"provider": "anthropic",
"model": "claude-haiku-4-5-20251001",
"tools": ["Read", "Grep", "Glob", "Bash", "Write", "Edit"],
"specializations": ["coding", "refactoring"],
"is_champion": false
}'
```

Important:
- Set `is_champion` to `false` — the evolved agent must earn its rank through competition
- Keep the same provider/model unless there's a strong reason to change
- Use a slug like `original-slug-v2` or `original-slug-gen2` to show lineage
- Keep the same specializations

### Step 4: Show results

```bash
python -m mobius.cli agent list
```

## The Agentic-Eval Pattern You're Running

```
Load underperformers → Read judge feedback → Critique prompt
↑ ↓
└──── Self-check fails? ← Refine prompt ←───┘
Self-check passes
Register agent
```

This is the Evaluator-Optimizer pattern from agentic-eval, with YOU as both evaluator and optimizer. The key advantage over the CLI version: you can reason about the feedback in full context, consider the agent's match history, and make nuanced improvements that a single API call might miss.

## Pro Tips

- **Don't just reword** — change the agent's methodology, add specific techniques, restructure the approach
- **Study the winners** — if the judge praised a winning agent's approach, consider incorporating similar strategies
- **Vary approaches** — if an agent keeps losing with approach X, try a fundamentally different approach Y
- **Keep prompts focused** — under 2000 tokens but dense with specific guidance
- **Consider the model** — Haiku benefits from very explicit instructions; Gemini Flash may need different prompt styles
119 changes: 119 additions & 0 deletions .claude/skills/mobius-evolve/scripts/load_underperformers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""Load underperforming agents with their loss feedback for evolution.

Usage:
python load_underperformers.py [specialization] [--threshold 0.4] [--min-matches 3]

Outputs agent details, win rates, and judge feedback from their losses
so that Opus can craft improved system prompts.
"""

import sys

sys.path.insert(0, "src")

from mobius.config import get_config
from mobius.db import init_db
from mobius.registry import Registry
from mobius.tournament import Tournament


def _win_rate_excluding_voided(tournament, agent_id: str, window: int) -> tuple[float, int]:
"""Calculate win rate excluding voided matches. Returns (rate, valid_count)."""
matches = tournament.get_agent_matches(agent_id, limit=window)
valid = [m for m in matches if not m.voided]
if not valid:
return 0.0, 0
wins = sum(1 for m in valid if m.winner_id == agent_id)
return wins / len(valid), len(valid)


def main():
args = sys.argv[1:]

# Parse flags
specialization = None
threshold = 0.4
min_matches = 3

i = 0
while i < len(args):
if args[i] == "--threshold" and i + 1 < len(args):
threshold = float(args[i + 1])
i += 2
elif args[i] == "--min-matches" and i + 1 < len(args):
min_matches = int(args[i + 1])
i += 2
elif not args[i].startswith("--"):
specialization = args[i]
i += 1
else:
i += 1

config = get_config()
conn, _ = init_db(config)
registry = Registry(conn, config)
tournament = Tournament(conn, config, registry)

agents = registry.list_agents(specialization=specialization)
if not agents:
print(f"No agents found{' for ' + specialization if specialization else ''}.")
sys.exit(1)

underperformers = []
for agent in agents:
win_rate, valid_count = _win_rate_excluding_voided(
tournament, agent.id, config.underperformer_window
)
if valid_count < min_matches:
continue
if win_rate < threshold:
underperformers.append((agent, win_rate))

if not underperformers:
print(f"No underperformers below {threshold:.0%} win rate (min {min_matches} matches).")
print("\nAll agents:")
for agent in agents:
wr, _ = _win_rate_excluding_voided(
tournament, agent.id, config.underperformer_window
)
print(f" {agent.name} ({agent.slug}) — {wr:.0%} win rate, {agent.total_matches} matches")
sys.exit(0)

print(f"UNDERPERFORMERS (below {threshold:.0%} win rate, min {min_matches} matches)")
print(f"Found: {len(underperformers)}")
print()

for agent, win_rate in underperformers:
matches = tournament.get_agent_matches(agent.id, limit=10)
losses = [m for m in matches if m.winner_id is not None and m.winner_id != agent.id and not m.voided]

print(f"{'='*70}")
print(f"AGENT: {agent.name}")
print(f" Slug: {agent.slug}")
print(f" ID: {agent.id}")
print(f" Provider: {agent.provider}/{agent.model}")
print(f" Win Rate: {win_rate:.0%} (Elo: {agent.elo_rating:.0f})")
print(f" Generation: {agent.generation}")
print(f" Specializations: {', '.join(agent.specializations)}")
print(f" Losses: {len(losses)}")
print()
print(" CURRENT SYSTEM PROMPT:")
print(f" {agent.system_prompt}")
print()

if losses:
print(" JUDGE FEEDBACK FROM LOSSES:")
for j, m in enumerate(losses[:5], 1):
print(f" --- Loss {j} ---")
print(f" Task: {m.task_description[:150]}")
if m.judge_reasoning:
print(f" Judge: {m.judge_reasoning[:300]}")
print()

print()

conn.close()


if __name__ == "__main__":
main()
Loading