Autonomous experiment loops for Claude Code.
Edit code → benchmark → keep or revert → repeat. Forever. While you sleep.
Inspired by karpathy/autoresearch, but domain-agnostic. Optimize anything with a number — test speed, bundle size, ML training loss, build time, Lighthouse scores. Claude runs hundreds of experiments autonomously, committing what works and reverting what doesn't.
┌─────────────────────────────────────────────────┐
│ autoresearch loop │
│ │
│ ┌──────────┐ ┌───────────┐ ┌─────────┐ │
│ │ Analyze │───▶│ Edit │───▶│ Run │ │
│ │ code │ │ 1 idea │ │ bench │ │
│ └──────────┘ └───────────┘ └────┬────┘ │
│ ▲ │ │
│ │ ▼ │
│ ┌────┴─────┐ ┌─────────┐ │
│ │ Log to │◀───────────────────│ Better? │ │
│ │ JSONL │ └────┬────┘ │
│ └──────────┘ │ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌──────────┐ │
│ │ Keep │ git commit │ Discard │ │
│ │ (commit)│ │ (revert) │ │
│ └─────────┘ └──────────┘ │
│ │
│ NEVER STOP │
└─────────────────────────────────────────────────┘
# Clone and copy the skill into your project
git clone https://github.com/dnh33/autoresearch.git /tmp/autoresearch
cp -r /tmp/autoresearch/.claude/skills/autoresearch .claude/skills/
cp /tmp/autoresearch/.claude/commands/autoresearch.md .claude/commands/
cp /tmp/autoresearch/.claude/settings.json .claude/settings.json
cp /tmp/autoresearch/CLAUDE.md ./CLAUDE.mdThen merge settings.json if you already have one — you just need the SessionStart hook.
git clone https://github.com/dnh33/autoresearch.git my-project
cd my-projectIn Claude Code:
/autoresearch "optimize test speed"
That's it. Claude will:
- Ask a few questions about your setup (or infer from context)
- Create
autoresearch.md(session protocol) andautoresearch.sh(benchmark script) - Run a baseline measurement
- Start looping — making changes, benchmarking, keeping improvements, reverting failures
- Never stop until you interrupt
Claude runs an infinite experiment loop:
- Analyze — Read the source, understand bottlenecks, form a hypothesis
- Edit — Make a focused change (prefer single-file edits)
- Benchmark — Run
autoresearch.sh, parseMETRIC name=valueoutput - Evaluate — Compare against baseline. Better? Keep. Worse? Discard.
- Log — Append result to
autoresearch.jsonl(survives git reverts) - Commit or revert —
git commiton keep,git checkout -- .on discard - GOTO 1
Context window full? Claude gets compacted? No problem.
A SessionStart hook automatically injects experiment progress into every new session. The fresh agent reads autoresearch.md + autoresearch.jsonl and picks up exactly where the last one left off. Your optimization runs can span hours across many context resets.
Need tests to keep passing while you optimize? Create autoresearch.checks.sh:
#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50It runs after every passing benchmark. If checks fail, the result cannot be kept — the change gets reverted. This gives you backpressure against optimizations that break things.
# Claude runs this periodically, or you can ask for it
bash .claude/skills/autoresearch/scripts/dashboard.shOpens an interactive HTML dashboard in your browser with:
- Stats cards (total runs, kept, discarded, crashed)
- Progress chart (metric over time, color-coded by status)
- Full results table with commit hashes and descriptions
Complex optimizations that aren't worth pursuing right now get saved to autoresearch.ideas.md. When Claude resumes after a context reset, it checks the backlog and experiments with deferred ideas.
your-project/
├── .claude/
│ ├── commands/
│ │ └── autoresearch.md # /autoresearch slash command
│ ├── settings.json # SessionStart hook for persistence
│ └── skills/
│ └── autoresearch/
│ ├── SKILL.md # Skill definition + protocol
│ └── scripts/
│ ├── init-experiment.sh # Initialize session config
│ ├── run-experiment.sh # Timed benchmark execution
│ ├── log-experiment.sh # Record result + git commit/revert
│ ├── dashboard.sh # Interactive HTML dashboard
│ ├── summary.sh # Terminal progress summary
│ └── session-context.sh # SessionStart hook script
├── CLAUDE.md # Tells Claude about autoresearch mode
├── autoresearch.md # Living session doc (created per-project)
├── autoresearch.sh # Benchmark script (created per-project)
├── autoresearch.checks.sh # Optional correctness checks
├── autoresearch.ideas.md # Deferred optimization ideas
└── autoresearch.jsonl # Append-only experiment log
Files in italics are created by Claude when you start a session. The .claude/ directory is the skill — everything else is generated.
| Domain | Metric | Direction | Benchmark command |
|---|---|---|---|
| Test speed | seconds |
lower ↓ | pnpm test --run |
| Bundle size | size_kb |
lower ↓ | pnpm build && du -sb dist |
| ML training | val_bpb |
lower ↓ | uv run train.py |
| Build speed | seconds |
lower ↓ | pnpm build |
| Lighthouse | perf_score |
higher ↑ | lighthouse http://localhost:3000 --output=json |
| API latency | p99_ms |
lower ↓ | k6 run load-test.js |
| Memory usage | rss_mb |
lower ↓ | node --max-old-space-size=512 app.js |
Same idea, different scope:
| karpathy/autoresearch | autoresearch (this) | |
|---|---|---|
| Domain | ML training (val_bpb) |
Anything with a metric |
| Agent | Claude / Cursor / Codex | Claude Code |
| Files to modify | train.py only |
Any files in scope |
| Session persistence | None — agent re-reads source | SessionStart hook + JSONL history |
| Correctness checks | None | autoresearch.checks.sh |
| Results format | TSV (manual append) | JSONL (automatic, structured) |
| Visualization | Post-hoc Jupyter notebook | Live HTML dashboard |
| Ideas backlog | None | autoresearch.ideas.md |
| Platform | Linux + CUDA GPU | Any OS (Windows, Mac, Linux) |
Karpathy's version proves the concept with ML training. This version makes it work for any optimization target, on any platform, with session persistence built in.
Created by Claude at setup. Contains: objective, metrics, files in scope, constraints, experiment protocol, decision rules, and a "What's Been Tried" section that gets updated as experiments run. A fresh agent with no context should be able to read this file alone and continue the loop.
Your benchmark. Must output METRIC name=value lines:
#!/bin/bash
set -euo pipefail
pnpm build 2>&1 | tail -5
du -sb dist | awk '{print "METRIC size_kb=" int($1 / 1024)}'Append-only, gitignored (survives git checkout reverts). Two entry types:
Built into the protocol — Claude follows these automatically:
- Primary metric is king. Improved → keep. Worse or equal → discard.
- Simpler code for equal perf = always keep. Removing code is a win.
- Ugly complexity for tiny gain = probably discard.
- Don't thrash. Repeated reverts on same idea → try something structurally different.
- Think longer when stuck. Deep understanding > random variations.
- Crashes: fix if trivial, log and move on if not.
MIT