An Operating System for Autonomous Research

An Operating System for Autonomous Research

Autonomous research execution, not just research generation.
Governed, checkpointed, inspectable research runs from brief to manuscript.

English · 한국어 · 日本語 · 简体中文 · 繁體中文 · Español · Français · Deutsch · Português · Русский

_{Localized README files are maintained translations of this document. For normative wording and latest edits, use the English README as the canonical reference.}

AutoLabOS is a governed operating system for research execution. It treats a run as checkpointed research state rather than a one-shot generation step.

The core loop is inspectable end to end: literature collection, hypothesis formation, experiment design, execution, analysis, figure audit, review, and manuscript drafting all produce auditable artifacts. Claims stay evidence-bounded through a claim ceiling. Review is a structural gate, not a polish pass.

Quality assumptions are turned into explicit checks. Real behavior matters more than prompt-level appearance. Reproducibility is enforced through artifacts, checkpoints, and inspectable transitions.

Why AutoLabOS Exists

Most research-agent systems are optimized around producing text. AutoLabOS is optimized around running a governed research process.

That difference matters when a project needs more than a plausible-looking draft:

a research brief that acts as an execution contract
explicit workflow gates instead of open-ended agent drift
checkpoints and artifacts that can be inspected after the fact
review that can stop weak work before manuscript generation
failure memory so the same failed experiment is not repeated blindly
evidence-bounded claims rather than prose that outruns the data

AutoLabOS is for teams that want autonomous help without giving up auditability, backtracking, or validation.

What Happens In One Run

One governed run follows the same research arc every time:

Brief.md → literature → hypothesis → experiment design → implementation → execution → analysis → figure audit → review → manuscript

In practice:

/new creates or opens a research brief.
/brief start --latest validates the brief, snapshots it into the run, and launches a governed run.
The system moves through the fixed research workflow, checkpointing state and artifacts at each boundary.
Weak evidence triggers backtracking or downgrade instead of automatic polishing.
If the review gate passes, write_paper drafts a manuscript from bounded evidence.

The historical 9-node contract remains the architectural baseline. In the current runtime, figure_audit is the one approved post-analysis checkpoint inserted between analyze_results and review so figure-quality critique can checkpoint and resume independently.

stateDiagram-v2
    [*] --> collect_papers
    collect_papers --> analyze_papers: complete
    analyze_papers --> generate_hypotheses: complete
    generate_hypotheses --> design_experiments: complete
    design_experiments --> implement_experiments: complete
    implement_experiments --> run_experiments: auto_handoff or complete
    run_experiments --> analyze_results: complete
    analyze_results --> figure_audit: auto_advance
    analyze_results --> implement_experiments: auto_backtrack_to_implement
    analyze_results --> design_experiments: auto_backtrack_to_design
    analyze_results --> generate_hypotheses: auto_backtrack_to_hypotheses
    figure_audit --> review: auto_advance
    review --> write_paper: auto_advance
    review --> implement_experiments: auto_backtrack_to_implement
    review --> design_experiments: auto_backtrack_to_design
    review --> generate_hypotheses: auto_backtrack_to_hypotheses
    write_paper --> [*]: auto_complete

All automation inside that flow is bounded inside node-internal loops. The workflow stays governed even in unattended modes.

What You Get After A Run

AutoLabOS does not just emit a PDF. It emits a traceable research state.

Output	What it contains
Literature corpus	Collected papers, BibTeX, extracted evidence store
Hypotheses	Literature-grounded hypotheses with skeptical review
Experiment plan	Governed design with contract, baseline lock, and consistency checks
Executed results	Metrics, objective evaluation, failure memory log
Result analysis	Statistical analysis, attempt decisions, transition reasoning
Figure audit	Figure lint, caption/reference consistency, optional vision critique summary
Review packet	5-specialist panel scorecard, claim ceiling, pre-draft critique
Manuscript	LaTeX draft with evidence links, scientific validation, optional PDF
Checkpoints	Full state snapshots at every node boundary — resume anytime

Everything lives under .autolabos/runs/<run_id>/, with public-facing outputs mirrored to outputs/.

That is the reproducibility model: artifacts, checkpoints, and inspectable transitions rather than hidden state.

Quick Start

# 1. Install and build
npm install
npm run build
npm link

# 2. Move to a research workspace
cd /path/to/your-research-workspace

# 3. Launch one interface
autolabos        # TUI
autolabos web    # Web UI

Typical first-use flow:

/new
/brief start --latest
/doctor

Notes:

Both UIs guide onboarding if .autolabos/config.yaml does not exist yet.
Do not run AutoLabOS from the repository root itself. Use a workspace such as test/ or your own research workspace.
TUI and Web UI share the same runtime, artifacts, and checkpoints.

Prerequisites

Item	When needed	Notes
`SEMANTIC_SCHOLAR_API_KEY`	Always	Paper discovery and metadata
`OPENAI_API_KEY`	When provider is `api`	OpenAI API model execution
Codex CLI login	When provider is `codex`	Uses your local Codex session

Research Brief System

The brief is not just a startup note. It is the governed contract for a run.

/new creates or opens Brief.md. /brief start --latest validates it, snapshots it into the run, and starts execution from that snapshot. The run records the brief source path, the snapshot path, and any parsed manuscript format so the provenance of the run remains inspectable even if the workspace brief changes later.

That makes the brief part of the audit trail, not just part of the prompt.

/new
/brief start --latest

Briefs are expected to define both research intent and governance constraints: topic, objective metric, baseline or comparator, minimum acceptable evidence, disallowed shortcuts, and the paper ceiling if evidence remains weak.

Brief sections and grading

Section	Status	Purpose
`## Topic`	Required	Research question in 1-3 sentences
`## Objective Metric`	Required	Primary success metric
`## Constraints`	Recommended	Compute budget, dataset limits, reproducibility rules
`## Plan`	Recommended	Step-by-step experiment plan
`## Target Comparison`	Governance	Proposed method vs. explicit baseline
`## Minimum Acceptable Evidence`	Governance	Minimum effect size, fold count, decision boundary
`## Disallowed Shortcuts`	Governance	Shortcuts that invalidate results
`## Paper Ceiling If Evidence Remains Weak`	Governance	Maximum paper classification if evidence is insufficient
`## Manuscript Format`	Optional	Column count, page budget, reference/appendix rules

Grade	Meaning	Paper-scale ready?
`complete`	Core + 4+ governance sections substantive	Yes
`partial`	Core complete + 2+ governance	Proceed with warnings
`minimal`	Only core sections	No

Two Interfaces, One Runtime

AutoLabOS has two front ends over the same governed runtime.

	TUI	Web UI
Launch	`autolabos`	`autolabos web`
Interaction	Slash commands, natural language	Browser dashboard and composer
Workflow view	Real-time node progress in terminal	Governed workflow graph with actions
Artifacts	CLI inspection	Inline preview for text, images, PDFs
Operations surfaces	`/watch`, `/queue`, `/explore`, `/doctor`	Jobs queue, live watch cards, exploration status, diagnostics
Best for	Fast iteration and direct control	Visual monitoring and artifact browsing

The important constraint is that both surfaces see the same checkpoints, the same runs, and the same underlying artifacts.

What Makes AutoLabOS Different

AutoLabOS is designed around governed execution rather than prompt-only orchestration.

	Typical research tools	AutoLabOS
Workflow	Open-ended agent drift	Governed fixed graph with explicit review boundaries
State	Ephemeral	Checkpointed, resumable, inspectable
Claims	As strong as the model will generate	Bounded by evidence and a claim ceiling
Review	Optional cleanup pass	Structural gate that can block writing
Failures	Forgotten and retried	Fingerprinted in failure memory
Validation	Secondary	First-class surface: `/doctor`, harnesses, smoke checks, live validation
Interfaces	Separate code paths	TUI and Web share one runtime

This is why the system reads more like research infrastructure than a paper generator.

Core Guarantees

Governed Workflow

The workflow is bounded and auditable. Backtracking is part of the contract. Results that do not justify forward motion are sent back to hypotheses, design, or implementation rather than polished into stronger prose.

Checkpointed Research State

Every node boundary writes state you can inspect and resume. The unit of progress is not only text output. It is a run with artifacts, transitions, and recoverable state.

Claim Ceiling

Claims are kept under the strongest defensible evidence ceiling. The system records blocked stronger claims and the evidence gaps required to unlock them.

Review As A Structural Gate

review is not a cosmetic cleanup stage. It is where readiness, methodology sanity, evidence linkage, writing discipline, and reproducibility handoff are checked before manuscript generation.

Failure Memory

Failure fingerprints are persisted so structural errors and repeated equivalent failures are not retried blindly.

Reproducibility Through Artifacts

Reproducibility is enforced through artifacts, checkpoints, and inspectable transitions. Public-facing summaries mirror persisted run outputs rather than inventing a second source of truth.

Validation And Harness-Oriented Quality Model

AutoLabOS treats validation surfaces as first-class.

/doctor checks environment and workspace readiness before a run starts
harness validation protects workflow, artifact, and governance contracts
smoke checks exist for targeted diagnostic coverage
live validation is used when interactive behavior matters

Paper readiness is not a single binary prompt judgment.

Layer 1 - deterministic minimum gate blocks under-evidenced work with explicit artifact and evidence-integrity checks
Layer 2 - LLM paper-quality evaluator adds structured critique over methodology, evidence strength, writing structure, claim support, and limitations honesty
Review packet + specialist panel determine whether the manuscript path should advance, revise, or backtrack

paper_readiness.json can include an overall_score. It should be read as a run-quality signal inside the system, not as a universal scientific benchmark. Some advanced evaluation and self-improvement flows use that score to compare runs or candidate prompt mutations.

Why the validation model matters

Quality assumptions are turned into explicit checks. Real behavior matters more than prompt-level appearance. The intended result is not "the model wrote something convincing," but "the run can be inspected and defended."

Advanced Self-Improvement Capabilities

AutoLabOS includes bounded self-improvement paths, but they are governed by validation and rollback rather than blind autonomous rewriting.

`autolabos meta-harness`

autolabos meta-harness builds a context directory from recent completed runs and evaluation history under outputs/meta-harness/<timestamp>/.

It can include:

filtered run events
node artifacts such as result_analysis.json or review/decision.json
paper_readiness.json
outputs/eval-harness/history.jsonl
current node-prompts/ files for the targeted node

The LLM is instructed through TASK.md to return only TARGET_FILE + unified diff, and the target is constrained to node-prompts/. In apply mode, the candidate must pass validate:harness; otherwise the change is rolled back and an audit log is written. --no-apply builds context only. --dry-run shows the diff without modifying files.

`autolabos evolve`

autolabos evolve runs a bounded mutation-and-evaluation loop over .codex and node-prompts.

supports --max-cycles, --target skills|prompts|all, and --dry-run
reads run fitness from paper_readiness.overall_score
can mutate prompts and skills, run validation, and compare fitness across cycles
rolls back regressions by restoring .codex and node-prompts from the last good git tag

This is a self-improvement path, but not an unconstrained repo-wide rewrite path.

Harness Preset Layer

AutoLabOS also has built-in harness presets such as base, compact, failure-aware, and review-heavy. These adjust artifact/context policy, failure-memory emphasis, prompt policy, and compression strategy for comparative evaluation paths without changing the governed production workflow.

Common Commands

Command	Description
`/new`	Create or open `Brief.md`
`/brief start <path\|--latest>`	Start research from a brief
`/runs [query]`	List or search runs
`/resume <run>`	Resume a run
`/agent run <node> [run]`	Execute from a graph node
`/agent status [run]`	Show node statuses
`/agent overnight [run]`	Run unattended with conservative bounds
`/agent autonomous [run]`	Run open-ended bounded research exploration
`/watch`	Live watch view for active runs and background jobs
`/explore`	Show exploration-engine status for the active run
`/queue`	Show running, waiting, and stalled jobs
`/doctor`	Environment and workspace diagnostics
`/model`	Switch model and reasoning effort

Full command list

Command	Description
`/help`	Show command list
`/new`	Create or open workspace `Brief.md`
`/brief start <path\|--latest>`	Start research from workspace `Brief.md` or a brief path
`/doctor`	Environment + workspace diagnostics
`/runs [query]`	List or search runs
`/run <run>`	Select run
`/resume <run>`	Resume run
`/agent list`	List graph nodes
`/agent run <node> [run]`	Execute from node
`/agent status [run]`	Show node statuses
`/agent collect [query] [options]`	Collect papers
`/agent recollect <n> [run]`	Collect additional papers
`/agent focus <node>`	Move focus with safe jump
`/agent graph [run]`	Show graph state
`/agent resume [run] [checkpoint]`	Resume from checkpoint
`/agent retry [node] [run]`	Retry node
`/agent jump <node> [run] [--force]`	Jump node
`/agent overnight [run]`	Overnight autonomy (24h)
`/agent autonomous [run]`	Open-ended autonomous research
`/model`	Model and reasoning selector
`/approve`	Approve paused node
`/queue`	Show running / waiting / stalled jobs
`/watch`	Live watch view for active runs
`/explore`	Show exploration-engine status
`/retry`	Retry current node
`/settings`	Provider and model settings
`/quit`	Exit

Who This Is For / Not For

Good fit

teams that want autonomous help with a governed workflow
research engineering work where checkpoints and artifacts matter
paper-scale or paper-adjacent projects that need evidence discipline
environments where review, traceability, and resumability matter as much as generation

Not a good fit

users who only want a fast one-shot draft
workflows that do not need artifact trails or review gates
projects that want free-form agent behavior more than governed execution
cases where a simple literature summary tool is enough

Development

npm install
npm run build
npm test
npm run test:web
npm run validate:harness

Use the smallest honest validation set that covers the change. For interactive defects, tests are not a substitute for re-running the same TUI or Web flow when the environment allows it.

Useful commands:

npm run test:watch
npm run test:smoke:natural-collect
npm run test:smoke:natural-collect-execute
npm run test:smoke:all

Advanced Details

Execution modes

AutoLabOS preserves the governed workflow and safety gates across every mode.

Mode	Command	Behavior
Interactive	`autolabos`	Slash-command TUI with explicit approval gates
Minimal approval	Config: `approval_mode: minimal`	Auto-approves safe transitions
Hybrid approval	Config: `approval_mode: hybrid`	Auto-advances strong low-risk transitions, pauses risky or low-confidence ones
Overnight	`/agent overnight [run]`	Unattended single pass, 24-hour limit, conservative backtracking
Autonomous	`/agent autonomous [run]`	Open-ended bounded research exploration

Governance artifact flow

flowchart LR
    Brief["Research Brief<br/>completeness artifact"] --> Design["design_experiments"]
    Design --> Contract["Experiment Contract<br/>hypothesis, single change,<br/>confound check"]
    Design --> Consistency["Brief-Design Consistency<br/>warnings artifact"]
    Contract --> Run["run_experiments"]
    Run --> Failures["Failure Memory<br/>fingerprinted JSONL"]
    Run --> Analyze["analyze_results"]
    Analyze --> Decision["Attempt Decision<br/>keep/discard/replicate"]
    Decision --> FigureAudit["figure_audit"]
    FigureAudit --> Review["review"]
    Failures --> Review
    Contract --> Review
    Review --> Ceiling["Pre-Review Summary<br/>claim ceiling detail"]
    Ceiling --> Paper["write_paper"]

Artifact flow

flowchart TB
    A["collect_papers"] --> A1["corpus.jsonl, bibtex.bib"]
    A1 --> B["analyze_papers"]
    B --> B1["paper_summaries.jsonl, evidence_store.jsonl"]
    B1 --> C["generate_hypotheses"]
    C --> C1["hypotheses.jsonl"]
    C1 --> D["design_experiments"]
    D --> D1["experiment_plan.yaml, experiment_contract.json,<br/>brief_design_consistency.json"]
    D1 --> E["implement_experiments"]
    E --> F["run_experiments"]
    F --> F1["metrics.json, failure_memory.jsonl,<br/>objective_evaluation.json"]
    F1 --> G["analyze_results"]
    G --> G1["result_analysis.json, attempt_decisions.jsonl,<br/>transition_recommendation.json"]
    G1 --> H["figure_audit"]
    H --> H1["gate1_gate2_issues.json,<br/>figure_audit_summary.json"]
    H1 --> I["review"]
    I --> I1["pre_review_summary.json, review_packet.json,<br/>minimum_gate.json, paper_critique.json"]
    I1 --> J["write_paper"]
    J --> J1["main.tex, references.bib,<br/>scientific_validation.json, main.pdf"]

Node architecture

Node	Role(s)	What it does
`collect_papers`	collector, curator	Discovers and curates candidate paper set via Semantic Scholar
`analyze_papers`	reader, evidence extractor	Extracts summaries and evidence from selected papers
`generate_hypotheses`	hypothesis agent + skeptical reviewer	Synthesizes ideas from literature, then pressure-tests them
`design_experiments`	designer + feasibility/statistical/ops panel	Filters plans for practicality, writes experiment contract
`implement_experiments`	implementer	Produces code and workspace changes through ACI actions
`run_experiments`	runner + failure triager + rerun planner	Drives execution, records failures, decides reruns
`analyze_results`	analyst + metric auditor + confounder detector	Checks result reliability, writes attempt decisions
`figure_audit`	figure auditor + optional vision critique	Checks evidence alignment, captions/references, and publication readiness before review
`review`	5-specialist panel + claim ceiling + two-layer gate	Structural review - blocks writing if evidence is insufficient
`write_paper`	paper writer + reviewer critique	Drafts manuscript, runs post-draft critique, builds PDF

Bounded automation

Node	Internal automation	Bound
`analyze_papers`	Auto-expands evidence window when too sparse	<= 2 expansions
`design_experiments`	Deterministic panel scoring + experiment contract	Runs once per design
`run_experiments`	Failure triage + one-shot transient rerun	Never retries structural failures
`run_experiments`	Failure memory fingerprinting	>= 3 identical exhausts retries
`analyze_results`	Objective rematching + result panel calibration	One rematch before human pause
`figure_audit`	Gate 3 figure critique + summary aggregation	Vision critique remains independently resumable
`write_paper`	Related-work scout + validation-aware repair	1 repair pass max

Public output bundle

outputs/<title-slug>-<run_id_prefix>/
  ├── paper/
  ├── experiment/
  ├── analysis/
  ├── review/
  ├── results/
  ├── reproduce/
  ├── manifest.json
  └── README.md

Status

AutoLabOS is an active OSS research-engineering project. The canonical references for behavior and contracts are the repository docs under docs/, especially:

docs/architecture.md
docs/tui-live-validation.md
docs/experiment-quality-bar.md
docs/paper-quality-bar.md
docs/reproducibility.md
docs/research-brief-template.md

If you are changing runtime behavior, treat those documents, the shipped tests, and the observable artifacts as the source of truth.

Name		Name	Last commit message	Last commit date
Latest commit History 192 Commits
.codex/skills		.codex/skills
.github/workflows		.github/workflows
docs		docs
node-prompts		node-prompts
scripts		scripts
src		src
tests		tests
web		web
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
ISSUES.md		ISSUES.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
TODO.md		TODO.md
package-lock.json		package-lock.json
package.json		package.json
run-tests.mjs		run-tests.mjs
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An Operating System for Autonomous Research

Why AutoLabOS Exists

What Happens In One Run

What You Get After A Run

Quick Start

Prerequisites

Research Brief System

Two Interfaces, One Runtime

What Makes AutoLabOS Different

Core Guarantees

Governed Workflow

Checkpointed Research State

Claim Ceiling

Review As A Structural Gate

Failure Memory

Reproducibility Through Artifacts

Validation And Harness-Oriented Quality Model

Advanced Self-Improvement Capabilities

`autolabos meta-harness`

`autolabos evolve`

Harness Preset Layer

Common Commands

Who This Is For / Not For

Good fit

Not a good fit

Development

Advanced Details

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

An Operating System for Autonomous Research

Why AutoLabOS Exists

What Happens In One Run

What You Get After A Run

Quick Start

Prerequisites

Research Brief System

Two Interfaces, One Runtime

What Makes AutoLabOS Different

Core Guarantees

Governed Workflow

Checkpointed Research State

Claim Ceiling

Review As A Structural Gate

Failure Memory

Reproducibility Through Artifacts

Validation And Harness-Oriented Quality Model

Advanced Self-Improvement Capabilities

autolabos meta-harness

autolabos evolve

Harness Preset Layer

Common Commands

Who This Is For / Not For

Good fit

Not a good fit

Development

Advanced Details

Status

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`autolabos meta-harness`

`autolabos evolve`

Packages