A terminal-first, file-based research workflow runner for long-form AI-assisted research.
It drives a fixed 8-stage research pipeline, requires human approval after every stage,
and writes every prompt, log, summary, and artifact into an isolated run directory.
Why AutoR · Showcase · Quick Start · How It Works · Run Layout · Architecture · Roadmap
AutoR is not a chat demo, not a generic agent framework, and not a markdown-only research toy.
It is a research execution loop: goal -> literature -> hypothesis -> design -> implementation -> experiments -> analysis -> paper -> dissemination, with explicit human control at every stage and real artifacts on disk.
Most AI research demos stop at "the model wrote a plausible summary."
AutoR is built around a harder standard: the system should leave behind a run directory that another person can inspect, resume, audit, and critique.
| AutoR does | Why it matters |
|---|---|
| Fixed 8-stage research workflow | The system behaves like a real research process instead of a free-form chat loop. |
| Mandatory human approval after every stage | AI executes; humans retain control at high-leverage decision points. |
Full run isolation under runs/<run_id>/ |
Prompts, logs, stage outputs, code, figures, and papers are all auditable. |
| Draft -> validate -> promote for stage summaries | Half-finished summaries do not silently become official stage records. |
| Artifact-aware validation | Later stages must produce data, results, figures, LaTeX, PDF, and review assets, not just prose. |
| Resume and redo-stage support | Long runs are recoverable and partially repeatable. |
| Stage-local conversation continuation | Refinement improves the current stage instead of constantly resetting context. |
| Venue-aware writing stage | Stage 07 can target lightweight conference or journal-style paper packaging without pretending to be a full submission system. |
- A run is isolated under
runs/<run_id>/. - Claude never writes directly to the final stage summary file.
- Human approval is required before the workflow advances.
- Approved summaries are appended to
memory.md; failed attempts are not. - Stage 03+ must produce machine-readable data artifacts.
- Stage 05+ must produce machine-readable result artifacts.
- Stage 06+ must produce real figure files.
- Stage 07+ must produce a venue-aware manuscript package with a PDF.
- Stage 08+ must produce review and readiness materials.
AutoR already has a full example run used throughout the repository: runs/20260330_101222.
That run produced:
- a compiled paper PDF: example_paper.pdf
- executable research code
- machine-readable datasets and result files
- real figures used in the paper
- review and dissemination materials
Highlighted outcomes from that run:
AGSNv2reached 36.21 ± 1.08 on Actor- the system produced a full NeurIPS-style paper package
- the final run preserved the full human-in-the-loop approval trail
AutoR is designed for terminal-first execution, but the interaction layer is not limited to raw logs and plain prompts. The current UI supports banner-style startup, colored stage panels, parsed Claude event streams, wrapped markdown summaries, and a menu-driven approval loop suitable for demos and recordings.
Accuracy Comparison
|
Ablation + Actor Results
|
Two-Layer Narrative Figure
|
|
|
Page 1 Title, abstract, framing
|
Page 5 Method and training algorithm
|
Page 7 Main tables and per-seed results
|
The example run is interesting not because the AI was left alone, but because the human intervened at critical moments:
- Stage 02 narrowed the project to a single core claim.
- Stage 04 pushed the system to download real datasets and run actual pre-checks.
- Stage 05 forced experimentation to continue until real benchmark results were obtained.
- Stage 06 redirected the story away from leaderboard-only framing toward mechanism-driven analysis.
That is the intended shape of AutoR: AI handles execution load; humans steer the research when direction actually matters.
- Python 3.10+
- Claude CLI available on
PATHfor real runs - Local TeX tools are helpful for Stage 07, but not required for smoke tests
python main.pypython main.py --goal "Your research goal here"python main.py --fake-operator --goal "Smoke test"python main.py --model sonnet
python main.py --model opuspython main.py --venue neurips_2025
python main.py --venue nature
python main.py --venue jmlrIf --venue is omitted, AutoR defaults to neurips_2025.
python main.py --resume-run latest
python main.py --resume-run 20260329_210252 --redo-stage 03Valid stage identifiers include 03, 3, and 03_study_design.
AutoR uses a fixed 8-stage pipeline:
01_literature_survey02_hypothesis_generation03_study_design04_implementation05_experimentation06_analysis07_writing08_dissemination
flowchart TD
A[Start or resume run] --> S1[01 Literature Survey]
S1 --> H1{Human approval}
H1 -- Refine --> S1
H1 -- Approve --> S2[02 Hypothesis Generation]
H1 -- Abort --> X[Abort]
S2 --> H2{Human approval}
H2 -- Refine --> S2
H2 -- Approve --> S3[03 Study Design]
H2 -- Abort --> X
S3 --> H3{Human approval}
H3 -- Refine --> S3
H3 -- Approve --> S4[04 Implementation]
H3 -- Abort --> X
S4 --> H4{Human approval}
H4 -- Refine --> S4
H4 -- Approve --> S5[05 Experimentation]
H4 -- Abort --> X
S5 --> H5{Human approval}
H5 -- Refine --> S5
H5 -- Approve --> S6[06 Analysis]
H5 -- Abort --> X
S6 --> H6{Human approval}
H6 -- Refine --> S6
H6 -- Approve --> S7[07 Writing]
H6 -- Abort --> X
S7 --> H7{Human approval}
H7 -- Refine --> S7
H7 -- Approve --> S8[08 Dissemination]
H7 -- Abort --> X
S8 --> H8{Human approval}
H8 -- Refine --> S8
H8 -- Approve --> Z[Run complete]
H8 -- Abort --> X
flowchart TD
A[Build prompt from template + goal + memory + optional feedback] --> B[Start or resume stage session]
B --> C[Claude writes draft stage summary]
C --> D[Validate markdown and required artifacts]
D --> E{Valid?}
E -- No --> F[Repair, normalize, or rerun current stage]
F --> A
E -- Yes --> G[Promote draft to final stage summary]
G --> H{Human choice}
H -- 1 or 2 or 3 --> I[Continue current stage conversation with AI refinement]
I --> A
H -- 4 --> J[Continue current stage conversation with custom feedback]
J --> A
H -- 5 --> K[Append approved summary to memory.md]
K --> L[Continue to next stage]
H -- 6 --> X[Abort]
1 / 2 / 3: continue the same stage conversation using one of the AI's refinement suggestions4: continue the same stage conversation with custom user feedback5: approve and continue to the next stage6: abort the run
The stage loop is controlled by AutoR, not by Claude.
AutoR does not consider a run successful just because it generated a plausible markdown summary.
| Stage | Required non-toy output |
|---|---|
| Stage 03+ | Machine-readable data under workspace/data/ |
| Stage 05+ | Machine-readable results under workspace/results/ |
| Stage 06+ | Real figure files under workspace/figures/ |
| Stage 07+ | Venue-aware manuscript sources plus a compiled PDF |
| Stage 08+ | Review and readiness assets under workspace/reviews/ |
Required stage summary shape:
# Stage X: <name>
## Objective
## Previously Approved Stage Summaries
## What I Did
## Key Results
## Files Produced
## Suggestions for Refinement
## Your OptionsAdditional rules:
- exactly 3 numbered refinement suggestions
- the fixed 6 user options
- no
[In progress],[Pending],[TODO],[TBD], or similar placeholders - concrete file paths in
Files Produced
If a run only leaves behind markdown notes, it has not met AutoR's quality bar.
Every run lives entirely inside its own directory.
runs/<run_id>/
├── user_input.txt
├── memory.md
├── run_config.json
├── logs.txt
├── logs_raw.jsonl
├── prompt_cache/
├── operator_state/
├── stages/
└── workspace/
├── literature/
├── code/
├── data/
├── results/
├── writing/
├── figures/
├── artifacts/
├── notes/
└── reviews/
literature/: reading notes, survey tables, benchmark notescode/: runnable code, scripts, configs, implementationsdata/: machine-readable data and manifestsresults/: machine-readable experiment outputswriting/: LaTeX sources, sections, bibliography, tablesfigures/: real plots and paper figuresartifacts/: compiled PDFs and packaged deliverablesnotes/: temporary or supporting research notesreviews/: readiness, critique, and dissemination materials
For each stage attempt, AutoR assembles a prompt from:
- the stage template from src/prompts/
- the required stage summary contract
- execution-discipline constraints
user_input.txt- approved
memory.md - optional refinement feedback
- for continuation attempts, the current draft/final stage files and workspace context
The assembled prompt is written to runs/<run_id>/prompt_cache/, per-stage session IDs are stored in runs/<run_id>/operator_state/, and Claude is invoked in live streaming mode.
Exact Claude CLI pattern
First attempt for a stage:
claude --model <model> \
--permission-mode bypassPermissions \
--dangerously-skip-permissions \
--session-id <stage_session_id> \
-p @runs/<run_id>/prompt_cache/<stage>_attempt_<nn>.prompt.md \
--output-format stream-json \
--verboseContinuation attempt for the same stage:
claude --model <model> \
--permission-mode bypassPermissions \
--dangerously-skip-permissions \
--resume <stage_session_id> \
-p @runs/<run_id>/prompt_cache/<stage>_attempt_<nn>.prompt.md \
--output-format stream-json \
--verboseImportant behavior:
- refinement attempts reuse the same stage conversation whenever possible
- streamed Claude output is shown live in the terminal
- raw stream-json output is captured in
logs_raw.jsonl - if resume fails, AutoR can fall back to a fresh session
- if stage markdown is incomplete, AutoR can repair or normalize it locally
The main code lives in:
flowchart LR
A[main.py] --> B[src/manager.py]
B --> C[src/operator.py]
B --> D[src/utils.py]
B --> E[src/writing_manifest.py]
B --> F[src/prompts/*]
C --> D
- main.py: CLI entry point; starts new runs or resumes old ones
- src/manager.py: owns the 8-stage loop, approval flow, repair flow, and stage continuation policy
- src/operator.py: invokes Claude CLI, streams output, persists session IDs, resumes stage conversations, and falls back on resume failure
- src/utils.py: stage metadata, run paths, prompt assembly, markdown validation, artifact validation, and venue resolution
- src/writing_manifest.py: scans figures, results, data files, and stage summaries to generate Stage 07 writing context
- src/prompts/: one prompt template per stage
- fixed 8-stage workflow
- mandatory human approval after every stage
- one primary Claude invocation per stage attempt
- stage-local continuation within the same Claude session
- prompt caching via
@file - live streaming terminal output
- repair passes and local fallback normalization
- draft-to-final stage promotion
- artifact-aware validation
- resume and
--redo-stage - lightweight venue profiles for Stage 07 writing
- generic multi-agent orchestration
- database-backed runtime state
- concurrent stage execution
- heavyweight platform abstractions
- dashboard-first productization
The most valuable next steps are the ones that make AutoR more like a real research workflow, not more like a demo framework.
- Cross-stage rollback and invalidation Later-stage failures should be able to mark downstream work as stale.
- Machine-readable run manifest Add a lightweight source of truth for stage status, stale dependencies, and artifact pointers.
- Continuation handoff compression Make long stage refinement more stable without bloating context.
- Stronger automated tests Cover repair flow, resume fallback, artifact validation, and approval-loop correctness.
- Artifact indexing
Add lightweight metadata around
data/,results/,figures/, andwriting/. - Frontend run browser A lightweight UI for browsing runs, stages, logs, and artifacts, driven by the run directory itself.
runs/is gitignored.- AutoR controls workflow orchestration, not scientific truth.
- Submission-grade output still depends on the environment, model quality, local tools, and available datasets.
- Stage 07 venue support is intentionally lightweight metadata-driven packaging, not a promise of full official template compliance for every venue.
Join the project community channels:
| Discord | ||
|---|---|---|
![]() |
![]() |
![]() |









