autoresearch

Autonomous experiment loops for Claude Code.

Edit code → benchmark → keep or revert → repeat. Forever. While you sleep.

Inspired by karpathy/autoresearch, but domain-agnostic. Optimize anything with a number — test speed, bundle size, ML training loss, build time, Lighthouse scores. Claude runs hundreds of experiments autonomously, committing what works and reverting what doesn't.

    ┌─────────────────────────────────────────────────┐
    │              autoresearch loop                   │
    │                                                  │
    │   ┌──────────┐    ┌───────────┐    ┌─────────┐  │
    │   │  Analyze  │───▶│   Edit    │───▶│  Run    │  │
    │   │   code    │    │  1 idea   │    │ bench   │  │
    │   └──────────┘    └───────────┘    └────┬────┘  │
    │        ▲                                │       │
    │        │                                ▼       │
    │   ┌────┴─────┐                    ┌─────────┐   │
    │   │  Log to  │◀───────────────────│ Better? │   │
    │   │  JSONL   │                    └────┬────┘   │
    │   └──────────┘                         │        │
    │        │                               │        │
    │        ▼                               ▼        │
    │   ┌─────────┐                    ┌──────────┐   │
    │   │  Keep   │  git commit        │ Discard  │   │
    │   │ (commit)│                    │ (revert) │   │
    │   └─────────┘                    └──────────┘   │
    │                                                  │
    │                 NEVER STOP                       │
    └─────────────────────────────────────────────────┘

Quick Install

Option A: Copy into your project

# Clone and copy the skill into your project
git clone https://github.com/dnh33/autoresearch.git /tmp/autoresearch
cp -r /tmp/autoresearch/.claude/skills/autoresearch .claude/skills/
cp /tmp/autoresearch/.claude/commands/autoresearch.md .claude/commands/
cp /tmp/autoresearch/.claude/settings.json .claude/settings.json
cp /tmp/autoresearch/CLAUDE.md ./CLAUDE.md

Then merge settings.json if you already have one — you just need the SessionStart hook.

Option B: Start from this repo

git clone https://github.com/dnh33/autoresearch.git my-project
cd my-project

Quick Start

In Claude Code:

/autoresearch "optimize test speed"

That's it. Claude will:

Ask a few questions about your setup (or infer from context)
Create autoresearch.md (session protocol) and autoresearch.sh (benchmark script)
Run a baseline measurement
Start looping — making changes, benchmarking, keeping improvements, reverting failures
Never stop until you interrupt

How It Works

The Loop

Claude runs an infinite experiment loop:

Analyze — Read the source, understand bottlenecks, form a hypothesis
Edit — Make a focused change (prefer single-file edits)
Benchmark — Run autoresearch.sh, parse METRIC name=value output
Evaluate — Compare against baseline. Better? Keep. Worse? Discard.
Log — Append result to autoresearch.jsonl (survives git reverts)
Commit or revert — git commit on keep, git checkout -- . on discard
GOTO 1

Session Persistence

Context window full? Claude gets compacted? No problem.

A SessionStart hook automatically injects experiment progress into every new session. The fresh agent reads autoresearch.md + autoresearch.jsonl and picks up exactly where the last one left off. Your optimization runs can span hours across many context resets.

Correctness Checks

Need tests to keep passing while you optimize? Create autoresearch.checks.sh:

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50

It runs after every passing benchmark. If checks fail, the result cannot be kept — the change gets reverted. This gives you backpressure against optimizations that break things.

Live Dashboard

# Claude runs this periodically, or you can ask for it
bash .claude/skills/autoresearch/scripts/dashboard.sh

Opens an interactive HTML dashboard in your browser with:

Stats cards (total runs, kept, discarded, crashed)
Progress chart (metric over time, color-coded by status)
Full results table with commit hashes and descriptions

Ideas Backlog

Complex optimizations that aren't worth pursuing right now get saved to autoresearch.ideas.md. When Claude resumes after a context reset, it checks the backlog and experiments with deferred ideas.

File Structure

your-project/
├── .claude/
│   ├── commands/
│   │   └── autoresearch.md          # /autoresearch slash command
│   ├── settings.json                # SessionStart hook for persistence
│   └── skills/
│       └── autoresearch/
│           ├── SKILL.md             # Skill definition + protocol
│           └── scripts/
│               ├── init-experiment.sh    # Initialize session config
│               ├── run-experiment.sh     # Timed benchmark execution
│               ├── log-experiment.sh     # Record result + git commit/revert
│               ├── dashboard.sh          # Interactive HTML dashboard
│               ├── summary.sh            # Terminal progress summary
│               └── session-context.sh    # SessionStart hook script
├── CLAUDE.md                        # Tells Claude about autoresearch mode
├── autoresearch.md                  # Living session doc (created per-project)
├── autoresearch.sh                  # Benchmark script (created per-project)
├── autoresearch.checks.sh           # Optional correctness checks
├── autoresearch.ideas.md            # Deferred optimization ideas
└── autoresearch.jsonl               # Append-only experiment log

Files in italics are created by Claude when you start a session. The .claude/ directory is the skill — everything else is generated.

Example Domains

Domain	Metric	Direction	Benchmark command
Test speed	`seconds`	lower ↓	`pnpm test --run`
Bundle size	`size_kb`	lower ↓	`pnpm build && du -sb dist`
ML training	`val_bpb`	lower ↓	`uv run train.py`
Build speed	`seconds`	lower ↓	`pnpm build`
Lighthouse	`perf_score`	higher ↑	`lighthouse http://localhost:3000 --output=json`
API latency	`p99_ms`	lower ↓	`k6 run load-test.js`
Memory usage	`rss_mb`	lower ↓	`node --max-old-space-size=512 app.js`

vs. karpathy/autoresearch

Same idea, different scope:

	karpathy/autoresearch	autoresearch (this)
Domain	ML training (`val_bpb`)	Anything with a metric
Agent	Claude / Cursor / Codex	Claude Code
Files to modify	`train.py` only	Any files in scope
Session persistence	None — agent re-reads source	SessionStart hook + JSONL history
Correctness checks	None	`autoresearch.checks.sh`
Results format	TSV (manual append)	JSONL (automatic, structured)
Visualization	Post-hoc Jupyter notebook	Live HTML dashboard
Ideas backlog	None	`autoresearch.ideas.md`
Platform	Linux + CUDA GPU	Any OS (Windows, Mac, Linux)

Karpathy's version proves the concept with ML training. This version makes it work for any optimization target, on any platform, with session persistence built in.

Configuration

`autoresearch.md` — Session Protocol

Created by Claude at setup. Contains: objective, metrics, files in scope, constraints, experiment protocol, decision rules, and a "What's Been Tried" section that gets updated as experiments run. A fresh agent with no context should be able to read this file alone and continue the loop.

`autoresearch.sh` — Benchmark Script

Your benchmark. Must output METRIC name=value lines:

#!/bin/bash
set -euo pipefail
pnpm build 2>&1 | tail -5
du -sb dist | awk '{print "METRIC size_kb=" int($1 / 1024)}'

`autoresearch.jsonl` — Experiment Log

Append-only, gitignored (survives git checkout reverts). Two entry types:

// Config (written by init-experiment.sh)
{"type":"config","name":"Optimize build","metricName":"size_kb","metricUnit":"KB","bestDirection":"lower","timestamp":1711270800}

// Result (written by log-experiment.sh)
{"run":1,"commit":"abc1234","metric":142.5,"metrics":{},"status":"keep","description":"tree-shake unused utils","timestamp":1711270860,"segment":0}

Decision Rules

Built into the protocol — Claude follows these automatically:

Primary metric is king. Improved → keep. Worse or equal → discard.
Simpler code for equal perf = always keep. Removing code is a win.
Ugly complexity for tiny gain = probably discard.
Don't thrash. Repeated reverts on same idea → try something structurally different.
Think longer when stuck. Deep understanding > random variations.
Crashes: fix if trivial, log and move on if not.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.claude		.claude
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

autoresearch

Quick Install

Option A: Copy into your project

Option B: Start from this repo

Quick Start

How It Works

The Loop

Session Persistence

Correctness Checks

Live Dashboard

Ideas Backlog

File Structure

Example Domains

vs. karpathy/autoresearch

Configuration

`autoresearch.md` — Session Protocol

`autoresearch.sh` — Benchmark Script

`autoresearch.jsonl` — Experiment Log

Decision Rules

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

autoresearch

Quick Install

Option A: Copy into your project

Option B: Start from this repo

Quick Start

How It Works

The Loop

Session Persistence

Correctness Checks

Live Dashboard

Ideas Backlog

File Structure

Example Domains

vs. karpathy/autoresearch

Configuration

autoresearch.md — Session Protocol

autoresearch.sh — Benchmark Script

autoresearch.jsonl — Experiment Log

Decision Rules

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`autoresearch.md` — Session Protocol

`autoresearch.sh` — Benchmark Script

`autoresearch.jsonl` — Experiment Log

Packages