Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions dev/ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,18 +240,24 @@ And each skill works standalone: invoke `analyze-paper paper.pdf` for just a str
- [x] Write `references/methodology.md` (proof verification, security analysis, protocol extension, comparison, counterexample patterns)
- [x] Write `references/paper-analysis.md` (extraction guide for crypto/distributed systems/blockchain papers)
- [x] Define the workspace file contract between skills (the table above, formalized)
- [x] Build `/clarify-goal` skill (Stage 0); test independently
- [x] Build `/analyze-paper` skill; test independently
- [x] Build `/review-literature` skill (WebSearch only for now); test independently
- [x] Build `/formalize-problem` skill; test that it produces trust assumptions + security properties + performance goals
- [x] Build `/brainstorm` skill (Stage 2.5 recurring ideation); test independently
- [x] Build `/investigate` skill with full loop discipline:
- `notes/results.md` structured log with keep/discard per cycle (Principle 3)
- `current-understanding.md` that only advances on keep (Principle 4)
- Never-stop and when-stuck guidance (Principle 5)
- Simplicity criterion for evaluating cycles (Principle 6)
- [x] Build `/synthesize` skill; test independently
- [x] Build the `/reaper` orchestrator that composes them with subagent parallelism
- [x] Create eval framework (`evals/evals.json`) with test cases and quality criteria
- [x] Create test paper specifications (`dev/test-papers/README.md`)
- [x] Create eval framework with test cases and quality criteria — ships as a layered system (`evals/`):
- **L1 structural**: deterministic Python graders (`evals/graders/structural.py`, `consistency.py`) for required sections, min lengths, broken refs, keep-or-discard cycle invariant. Runs in CI on every PR.
- **L2 skill rubric**: per-skill rubrics (`evals/rubrics/<skill>.yaml`) graded by `claude -p` (`evals/judge/`) with JSON-schema-pinned output. Subscription tokens, run locally / nightly.
- **L3 end-to-end**: the 3 cases in `evals/evals.json`, full `/reaper` pipeline. Pre-release.
- Orchestrator: `python3 -m evals.run_evals --layer {structural,judge,all}`. Fixtures live at `evals/fixtures/<skill>/<case>/` with reference + planted-negative variants per layer.
- [x] Create test paper specifications (`dev/test-papers/README.md`) — specs only; actual PDFs not yet sourced
- [x] Tune skill descriptions for reliable triggering (added action verbs, specific outputs, broader trigger phrases)
- [ ] Upgrade `/synthesize` to produce a compilable LaTeX project instead of markdown:
- [ ] Change output from `report.md` to `report/` directory (`main.tex`, `references.bib`, `Makefile`)
Expand All @@ -262,11 +268,14 @@ And each skill works standalone: invoke `analyze-paper paper.pdf` for just a str
- [ ] Update the `/synthesize` SKILL.md template: replace the markdown template with LaTeX equivalents
- [ ] Ensure the orchestrator and other skills that reference `report.md` (e.g., critique reading the report) are updated to read `report/main.tex`
- [ ] Test: does `make` in `report/` produce a valid PDF without manual fixes?
- [ ] Test full pipeline end-to-end with 3 real papers:
- [ ] Source actual paper PDFs for the 3 specs in `dev/test-papers/README.md` (currently specs-only)
- [ ] Expand L1+L2 eval coverage beyond `analyze-paper` — add fixtures (reference + ≥1 planted negative per layer) and rubrics for the remaining skills (`review-literature`, `formalize-problem`, `brainstorm`, `investigate`, `critique`, `synthesize`)
- [ ] Calibrate judge prompts against `evals/golden/` (≥80% agreement with hand grades) before relying on new dimensions
- [ ] Test full pipeline end-to-end with the 3 papers once sourced:
- A cryptographic construction with a known proof gap
- A consensus protocol paper (e.g., compare HotStuff variants)
- A blockchain paper with questionable security claims
- [ ] Iterate on skills based on test results (using eval framework)
- [ ] Iterate on skills based on test results (using the layered eval framework)

### Horizon 2: The Library

Expand Down Expand Up @@ -407,7 +416,7 @@ Different models have different strengths. The critique skill should route consu

**Methodology stage:** All stages — makes the entire pipeline portable across AI agent platforms.

**Current state:** Reaper ships as standard `SKILL.md` folders compatible with the [`vercel-labs/skills`](https://github.com/vercel-labs/skills) convention. A single `npx skills add SebastianElvis/reaper` shallow-clones the repo and copies all 11 skill directories into the host agent's conventional skills folder, supporting 45+ targets including Cursor, OpenAI Codex CLI, Cline, Continue, Gemini CLI, Copilot, Windsurf, OpenCode, Warp, Goose, Replit, and Claude Code. The orchestrator and inter-skill triggers use host-agnostic phrasing ("invoke the `<name>` skill") so that auto-discovery agents and slash-command agents both route correctly.
**Current state:** Reaper ships as standard `SKILL.md` folders compatible with the [`vercel-labs/skills`](https://github.com/vercel-labs/skills) convention. A single `npx skills add SebastianElvis/reaper` shallow-clones the repo and copies all 10 skill directories into the host agent's conventional skills folder, supporting 45+ targets including Cursor, OpenAI Codex CLI, Cline, Continue, Gemini CLI, Copilot, Windsurf, OpenCode, Warp, Goose, Replit, and Claude Code. The orchestrator and inter-skill triggers use host-agnostic phrasing ("invoke the `<name>` skill") so that auto-discovery agents and slash-command agents both route correctly.

**Goal:** Make Reaper a first-class skills package — authored once, runnable on any host that consumes `SKILL.md` files. Achieved by converging on the `SKILL.md` convention rather than building per-host adapters.

Expand Down Expand Up @@ -652,15 +661,15 @@ The AI *is* the research agent. No wrapper needed.

### Why a Pipeline of Skills

A monolithic "do research" skill is hard to test, hard to improve, and hard to reuse partially. By decomposing into `/analyze-paper`, `/review-literature`, `/formalize-problem`, `/investigate`, `cross-verify`, and `/synthesize`, each skill:
A monolithic "do research" skill is hard to test, hard to improve, and hard to reuse partially. By decomposing into `/clarify-goal`, `/analyze-paper`, `/review-literature`, `/formalize-problem`, `/brainstorm`, `/investigate`, `/critique`, and `/synthesize`, each skill:
- Can be tested and iterated independently
- Can be used standalone (e.g., just analyze a paper without running the full pipeline)
- Has a clear input/output contract via workspace files
- Can be parallelized (independent skills run as concurrent subagents)

### Why Multi-Model Feedback

No single model is best at everything. Claude is strong at structured reasoning and long-context analysis. Other models may catch different classes of errors, suggest alternative approaches, or have different training data. The cross-verify step treats other models as peer reviewers — the same role human collaborators play in real research. This is especially important because, unlike autoresearch's val_bpb, theoretical research lacks an objective evaluation oracle.
No single model is best at everything. Claude is strong at structured reasoning and long-context analysis. Other models may catch different classes of errors, suggest alternative approaches, or have different training data. The `/critique` step treats other models as peer reviewers — the same role human collaborators play in real research. This is especially important because, unlike autoresearch's val_bpb, theoretical research lacks an objective evaluation oracle.

### Why File-Based State

Expand Down
Loading