Karpathy-style experiment framework for Atris.
This repo defines the schema, validation rules, and benchmark harness for self-improvement loops.
Live experiment packs belong inside product repos at atris/experiments/.
An experiment is not "the agent rewrote its prompt and said it improved."
An experiment is:
- one bounded target
- one external metric
- one keep/revert loop
- one append-only log
If the metric goes up, keep the change. If it does not, revert it.
atris/experiments/
├── README.md
├── validate.py
├── benchmark_validate.py
├── benchmark_runtime.py
└── <experiment-slug>/
├── program.md
├── measure.py
├── loop.py
├── results.tsv
├── reset.py # preferred
├── proposals/ # optional
└── <bounded-target> # candidate.py, system_prompt.txt, etc.
- One bounded mutation target per experiment.
measure.pymust use an external metric the agent cannot fake.loop.pymust keep only improvements and revert regressions.program.mdstays short and task-specific.results.tsvstays append-only.
template/pack/- starter files for a new experimentvalidate.py- structural and bloat checksbenchmark_validate.py- validator benchmark on fixed good/bad fixturesbenchmark_runtime.py- runtime benchmark on example packsexamples/- tiny reference implementation
Start with the smallest honest pack:
examples/smoke-keep-revert/
├── candidate.py
├── measure.py
├── loop.py
├── reset.py
├── results.tsv
└── proposals/
├── bad_patch.py
└── fix_patch.py
What it does:
candidate.pystarts broken on purposemeasure.pyscores it on a fixed word-count testbad_patch.pymakes it worsefix_patch.pyactually fixes itloop.pykeeps only the fix
Run it:
python examples/smoke-keep-revert/reset.py
python examples/smoke-keep-revert/loop.py \
--proposal examples/smoke-keep-revert/proposals/bad_patch.py \
--proposal examples/smoke-keep-revert/proposals/fix_patch.pyVisual:
broken target
↓
score = 0.2
↓
bad patch
↓
score = 0.0
↓
REVERT
↓
good patch
↓
score = 1.0
↓
KEEP
python validate.py examples
python benchmark_validate.py
python benchmark_runtime.py