Skip to content

dnh33/autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

autoresearch

Autonomous experiment loops for Claude Code.

Edit code → benchmark → keep or revert → repeat. Forever. While you sleep.

Inspired by karpathy/autoresearch, but domain-agnostic. Optimize anything with a number — test speed, bundle size, ML training loss, build time, Lighthouse scores. Claude runs hundreds of experiments autonomously, committing what works and reverting what doesn't.

    ┌─────────────────────────────────────────────────┐
    │              autoresearch loop                   │
    │                                                  │
    │   ┌──────────┐    ┌───────────┐    ┌─────────┐  │
    │   │  Analyze  │───▶│   Edit    │───▶│  Run    │  │
    │   │   code    │    │  1 idea   │    │ bench   │  │
    │   └──────────┘    └───────────┘    └────┬────┘  │
    │        ▲                                │       │
    │        │                                ▼       │
    │   ┌────┴─────┐                    ┌─────────┐   │
    │   │  Log to  │◀───────────────────│ Better? │   │
    │   │  JSONL   │                    └────┬────┘   │
    │   └──────────┘                         │        │
    │        │                               │        │
    │        ▼                               ▼        │
    │   ┌─────────┐                    ┌──────────┐   │
    │   │  Keep   │  git commit        │ Discard  │   │
    │   │ (commit)│                    │ (revert) │   │
    │   └─────────┘                    └──────────┘   │
    │                                                  │
    │                 NEVER STOP                       │
    └─────────────────────────────────────────────────┘

Quick Install

Option A: Copy into your project

# Clone and copy the skill into your project
git clone https://github.com/dnh33/autoresearch.git /tmp/autoresearch
cp -r /tmp/autoresearch/.claude/skills/autoresearch .claude/skills/
cp /tmp/autoresearch/.claude/commands/autoresearch.md .claude/commands/
cp /tmp/autoresearch/.claude/settings.json .claude/settings.json
cp /tmp/autoresearch/CLAUDE.md ./CLAUDE.md

Then merge settings.json if you already have one — you just need the SessionStart hook.

Option B: Start from this repo

git clone https://github.com/dnh33/autoresearch.git my-project
cd my-project

Quick Start

In Claude Code:

/autoresearch "optimize test speed"

That's it. Claude will:

  1. Ask a few questions about your setup (or infer from context)
  2. Create autoresearch.md (session protocol) and autoresearch.sh (benchmark script)
  3. Run a baseline measurement
  4. Start looping — making changes, benchmarking, keeping improvements, reverting failures
  5. Never stop until you interrupt

How It Works

The Loop

Claude runs an infinite experiment loop:

  1. Analyze — Read the source, understand bottlenecks, form a hypothesis
  2. Edit — Make a focused change (prefer single-file edits)
  3. Benchmark — Run autoresearch.sh, parse METRIC name=value output
  4. Evaluate — Compare against baseline. Better? Keep. Worse? Discard.
  5. Log — Append result to autoresearch.jsonl (survives git reverts)
  6. Commit or revertgit commit on keep, git checkout -- . on discard
  7. GOTO 1

Session Persistence

Context window full? Claude gets compacted? No problem.

A SessionStart hook automatically injects experiment progress into every new session. The fresh agent reads autoresearch.md + autoresearch.jsonl and picks up exactly where the last one left off. Your optimization runs can span hours across many context resets.

Correctness Checks

Need tests to keep passing while you optimize? Create autoresearch.checks.sh:

#!/bin/bash
set -euo pipefail
pnpm test --run --reporter=dot 2>&1 | tail -50

It runs after every passing benchmark. If checks fail, the result cannot be kept — the change gets reverted. This gives you backpressure against optimizations that break things.

Live Dashboard

# Claude runs this periodically, or you can ask for it
bash .claude/skills/autoresearch/scripts/dashboard.sh

Opens an interactive HTML dashboard in your browser with:

  • Stats cards (total runs, kept, discarded, crashed)
  • Progress chart (metric over time, color-coded by status)
  • Full results table with commit hashes and descriptions

Ideas Backlog

Complex optimizations that aren't worth pursuing right now get saved to autoresearch.ideas.md. When Claude resumes after a context reset, it checks the backlog and experiments with deferred ideas.

File Structure

your-project/
├── .claude/
│   ├── commands/
│   │   └── autoresearch.md          # /autoresearch slash command
│   ├── settings.json                # SessionStart hook for persistence
│   └── skills/
│       └── autoresearch/
│           ├── SKILL.md             # Skill definition + protocol
│           └── scripts/
│               ├── init-experiment.sh    # Initialize session config
│               ├── run-experiment.sh     # Timed benchmark execution
│               ├── log-experiment.sh     # Record result + git commit/revert
│               ├── dashboard.sh          # Interactive HTML dashboard
│               ├── summary.sh            # Terminal progress summary
│               └── session-context.sh    # SessionStart hook script
├── CLAUDE.md                        # Tells Claude about autoresearch mode
├── autoresearch.md                  # Living session doc (created per-project)
├── autoresearch.sh                  # Benchmark script (created per-project)
├── autoresearch.checks.sh           # Optional correctness checks
├── autoresearch.ideas.md            # Deferred optimization ideas
└── autoresearch.jsonl               # Append-only experiment log

Files in italics are created by Claude when you start a session. The .claude/ directory is the skill — everything else is generated.

Example Domains

Domain Metric Direction Benchmark command
Test speed seconds lower ↓ pnpm test --run
Bundle size size_kb lower ↓ pnpm build && du -sb dist
ML training val_bpb lower ↓ uv run train.py
Build speed seconds lower ↓ pnpm build
Lighthouse perf_score higher ↑ lighthouse http://localhost:3000 --output=json
API latency p99_ms lower ↓ k6 run load-test.js
Memory usage rss_mb lower ↓ node --max-old-space-size=512 app.js

vs. karpathy/autoresearch

Same idea, different scope:

karpathy/autoresearch autoresearch (this)
Domain ML training (val_bpb) Anything with a metric
Agent Claude / Cursor / Codex Claude Code
Files to modify train.py only Any files in scope
Session persistence None — agent re-reads source SessionStart hook + JSONL history
Correctness checks None autoresearch.checks.sh
Results format TSV (manual append) JSONL (automatic, structured)
Visualization Post-hoc Jupyter notebook Live HTML dashboard
Ideas backlog None autoresearch.ideas.md
Platform Linux + CUDA GPU Any OS (Windows, Mac, Linux)

Karpathy's version proves the concept with ML training. This version makes it work for any optimization target, on any platform, with session persistence built in.

Configuration

autoresearch.md — Session Protocol

Created by Claude at setup. Contains: objective, metrics, files in scope, constraints, experiment protocol, decision rules, and a "What's Been Tried" section that gets updated as experiments run. A fresh agent with no context should be able to read this file alone and continue the loop.

autoresearch.sh — Benchmark Script

Your benchmark. Must output METRIC name=value lines:

#!/bin/bash
set -euo pipefail
pnpm build 2>&1 | tail -5
du -sb dist | awk '{print "METRIC size_kb=" int($1 / 1024)}'

autoresearch.jsonl — Experiment Log

Append-only, gitignored (survives git checkout reverts). Two entry types:

// Config (written by init-experiment.sh)
{"type":"config","name":"Optimize build","metricName":"size_kb","metricUnit":"KB","bestDirection":"lower","timestamp":1711270800}

// Result (written by log-experiment.sh)
{"run":1,"commit":"abc1234","metric":142.5,"metrics":{},"status":"keep","description":"tree-shake unused utils","timestamp":1711270860,"segment":0}

Decision Rules

Built into the protocol — Claude follows these automatically:

  • Primary metric is king. Improved → keep. Worse or equal → discard.
  • Simpler code for equal perf = always keep. Removing code is a win.
  • Ugly complexity for tiny gain = probably discard.
  • Don't thrash. Repeated reverts on same idea → try something structurally different.
  • Think longer when stuck. Deep understanding > random variations.
  • Crashes: fix if trivial, log and move on if not.

License

MIT

About

Autonomous experiment loops for Claude Code — edit → benchmark → keep/revert → repeat forever

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages