SebastianElvis · SebastianElvis · May 2, 2026 · May 2, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -20,6 +20,22 @@ jobs:
       - name: Run structure tests
         run: pytest tests/test_skills_structure.py -v
 
+  # L1 structural eval tests — required. Drive the code-based graders in
+  # evals/graders/ against committed fixtures. Cheap, deterministic, no
+  # network or LLM calls. The matching L2 LLM-judge runs are invoked
+  # locally / on demand (see evals/README.md), not in CI.
+  evals-fast:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - uses: actions/setup-python@v5
+        with:
+          python-version: "3.11"
+      - name: Install Python dependencies
+        run: pip install pytest pyyaml
+      - name: Run L1 eval tests
+        run: pytest tests/test_skill_outputs.py -v
+
   # Network-dependent integration tests against live arXiv / IACR ePrint APIs.
   # Non-blocking: external services can rate-limit, return 5xx, or change
   # their HTML — none of which means the package is broken. We still run

diff --git a/.gitignore b/.gitignore
@@ -4,3 +4,5 @@ __pycache__/
 *.pyc
 .context/
 reaper-workspace/
+evals/runs/
+evals/reports/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -9,20 +9,24 @@ AI-native scientific research pipeline distributed as a host-agnostic skills pac
   - `/clarify-goal` — Interactive goal clarification (asks user targeted questions before pipeline runs)
   - `/analyze-paper`, `/review-literature`, `/formalize-problem`, `/brainstorm`, `/investigate`, `/critique`, `/synthesize` — Pipeline stages
   - `/search-paper` — Academic search + citation graph + venue resolution. Bundles five Python drivers (`arxiv.py`, `iacr.py`, `semantic_scholar.py`, `dblp.py`, `openalex.py`); the `SKILL.md` itself orchestrates the layered venue lookup.
-- `tests/` — Python tests for skill structure and search scripts
-- `evals/` — Test cases with quality criteria (`evals.json`)
+- `tests/` — Python tests for skill structure, search scripts, and L1 eval graders
+- `evals/` — Layered evaluation system. L1 code-based graders (`graders/`), L2 Claude-CLI LLM judges (`judge/`), per-skill rubrics (`rubrics/`), and fixtures with reference + planted-negative variants. Orchestrator: `python3 -m evals.run_evals`. See `evals/README.md`.
 - `dev/` — Development docs including `ROADMAP.md` (full methodology and design)
 - `.claude-plugin/` — Claude-Code-specific plugin manifest (`plugin.json`, `marketplace.json`); other hosts ignore this directory
 - `.github/workflows/` — CI (pytest + strict `npx skills` discovery check that asserts every expected skill, script, and reference file is present after installation)
 
 ## Commands
 
 ```bash
-# Run tests
+# Run tests (includes L1 structural eval graders)
 pytest tests/
 
-# Python dependencies for search skills
-pip install arxiv requests beautifulsoup4
+# Run the layered evals
+python3 -m evals.run_evals --layer structural                  # L1 only — no LLM, what CI runs
+python3 -m evals.run_evals --layer all --skill analyze-paper   # L1 + L2 (uses local `claude` CLI)
+
+# Python dependencies for search skills + evals
+pip install arxiv requests beautifulsoup4 pyyaml
 ```
 
 ## Key conventions
@@ -43,6 +47,7 @@ pip install arxiv requests beautifulsoup4
 - When cutting a release tag, the tag message should summarize changes since the last tag (use `git log <last-tag>..HEAD`).
 - Always use squash merge for PRs.
 - Before finishing a task, check if important docs (README.md, CLAUDE.md, dev/ROADMAP.md) need to be updated to reflect your changes.
+- Eval discipline: skill changes that affect a graded artifact (sections, output shape, quality criteria) must keep the corresponding rule in `evals/run_evals.py::SKILL_STRUCTURAL_RULES` and the rubric under `evals/rubrics/<skill>.yaml` in sync. Add fixtures (one reference + at least one planted negative per layer) before claiming coverage for a new skill. Calibrate new judge dimensions against `evals/golden/` before relying on them. Eval design and authoring follow Anthropic's [*Demystifying Evals for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) — code-based vs model-based vs human grader split, per-dimension scoring with an "unknown" escape hatch, isolated trials, two-sided cases (both planted negatives and references), and `pass^k` for consistency. Read it before adding a new layer or rubric.
 
 ## Distribution
 

diff --git a/README.md b/README.md
@@ -185,6 +185,26 @@ reaper-workspace/
 
 The workspace contract is host-agnostic — any agent that can read and write files in the working directory produces the same workspace structure.
 
+## Evaluation
+
+Skills ship with a layered evaluation system following Anthropic's [*Demystifying Evals for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) methodology. The judge is the local `claude` CLI — no API key, just your existing subscription.
+
+| Layer | Grader | Cadence | Scope |
+|---|---|---|---|
+| L1 Structural | Code (`evals/graders/`) | Every PR (CI) | Required sections, lengths, broken refs, keep-or-discard cycle invariant |
+| L2 Skill rubric | `claude -p` with structured-output JSON schema (`evals/judge/`) | Locally / nightly | Per-skill quality dimensions: groundedness, specificity, completeness |
+| L3 End-to-end | Both | Pre-release | Full `/reaper` pipeline against canonical cases |
+
+```bash
+# L1 only (no LLM) — same thing CI runs
+python3 -m evals.run_evals --layer structural
+
+# L1 + L2 (uses your local claude CLI)
+python3 -m evals.run_evals --layer all --skill analyze-paper
+```
+
+Each fixture pairs a gold-standard reference with planted negatives — one targeting L1 (drops a required section) and one targeting L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression. See [`evals/README.md`](evals/README.md) for the full design and how to add a fixture.
+
 ## Methodology
 
 Reaper's research loop follows six principles:
@@ -202,7 +222,7 @@ See [`dev/ROADMAP.md`](dev/ROADMAP.md) for the full methodology and development
 
 See [`dev/ROADMAP.md`](dev/ROADMAP.md) for the full roadmap.
 
-- **Horizon 1 (The Pipeline)**: Core skills, orchestrator, and eval framework — *complete; LaTeX report output planned*
+- **Horizon 1 (The Pipeline)**: Core skills, orchestrator, and layered eval system (L1 structural graders + L2 Claude-CLI judges with rubrics, calibrated against planted negatives) — *complete; LaTeX report output and broader rubric coverage across all skills planned*
 - **Horizon 2 (The Library)**: arXiv/ePrint search via Python scripts + citation graph + venue resolution (Semantic Scholar / DBLP / OpenAlex) — *complete*
 - **Horizon 3 (The Committee)**: Multi-model critique via the `/critique` skill's `--codex` mode — *Codex complete, Gemini/DeepSeek/local planned*
 - **Horizon 3.5 (The Polyglot)**: Cross-agent distribution via `npx skills` and host-agnostic skill prose — *complete; per-host orchestration polish ongoing*

diff --git a/evals/README.md b/evals/README.md
@@ -0,0 +1,114 @@
+# Reaper evals
+
+Layered evaluation for the Reaper skills, following [*Demystifying Evals
+for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents).
+The model-based judge is the local `claude` CLI — no API key required.
+
+## Three layers
+
+| Layer | Grader | Cost | Cadence | What it covers |
+|---|---|---|---|---|
+| **L1 Structural** | Code (`evals/graders/`) | Free | Every PR (CI) | Required sections present, min lengths, no broken refs, keep-or-discard cycle invariant |
+| **L2 Skill rubric** | `claude -p` (`evals/judge/`) | Subscription tokens | Locally / nightly | Per-skill quality dimensions (groundedness, specificity, completeness) |
+| **L3 End-to-end** | Both | Subscription tokens | Pre-release | Full `/reaper` pipeline against the 3 cases in `evals/evals.json` |
+
+## Layout
+
+```
+evals/
+  evals.json                 # case registry (kept for human reference)
+  fixtures/<skill>/<case>/   # one directory per fixture
+    spec.yaml                # variant declarations + expected layer outcomes
+    inputs/                  # what the skill consumes (paper text, etc.)
+    reference/               # gold-standard output (must pass every layer)
+    negative-structural/     # planted L1 violation (drops a section, etc.)
+    negative-quality/        # planted L2 violation (fabricated claim, etc.)
+  rubrics/<skill>.yaml       # which dimensions apply, and their pass thresholds
+  judge/
+    judge.py                 # claude CLI wrapper, JSON-schema enforced
+    schemas/rubric.json      # per-dimension structured-output shape
+    prompts/<dimension>.md   # one judge persona per rubric dimension
+  graders/
+    structural.py            # L1 assertion helpers (pure Python)
+    consistency.py           # cycle invariant verifier
+  run_evals.py               # orchestrator
+  runs/                      # per-trial workspaces (gitignored)
+  reports/                   # md + json reports (gitignored)
+```
+
+## Running
+
+```bash
+# L1 only — same thing CI runs (no claude CLI required)
+python3 -m evals.run_evals --layer structural
+
+# L2 only — judges every variant of every fixture (uses claude CLI)
+python3 -m evals.run_evals --layer judge --skill analyze-paper
+
+# Full run (L1 + L2)
+python3 -m evals.run_evals --layer all --skill analyze-paper
+
+# One variant of one case
+python3 -m evals.run_evals --layer all --skill analyze-paper --variant reference
+```
+
+The orchestrator stages each variant into a fresh `evals/runs/<run-id>/`
+directory before grading — per the eval guide's "isolated environments"
+recommendation. Reports land in `evals/reports/<run-id>.{md,json}`.
+
+The pytest entry point that CI uses lives at `tests/test_skill_outputs.py`
+and exercises the same L1 graders.
+
+## Adding a fixture
+
+1. Create `evals/fixtures/<skill>/<case>/`.
+2. Put what the skill consumes under `inputs/` (paper text, prior notes, etc.).
+3. Hand-write a gold-standard output under `reference/`.
+4. Write at least one **structural negative** (drops a required section, etc.)
+   under `negative-structural/` and one **quality negative** (fabricated
+   claim, generic content) under `negative-quality/`. One-sided evals create
+   one-sided optimization; both directions matter.
+5. Declare the variants in `spec.yaml` (see the existing
+   `cryptography-sample` fixture). Each negative carries a `target_layer`
+   so the orchestrator knows which grader is supposed to fail.
+6. If this is a new skill, add an entry to `SKILL_STRUCTURAL_RULES` in
+   `evals/run_evals.py` and a rubric file under `evals/rubrics/`.
+
+`tests/test_skill_outputs.py::test_every_fixture_skill_has_rules` will fail
+if you add a fixture without graders — coverage without graders is invisible.
+
+## Adding a judge dimension
+
+1. Drop a per-dimension prompt at `evals/judge/prompts/<dim>.md`. Lead with
+   the score scale, require a verbatim `evidence` quote, and include the
+   `"unknown"` escape hatch (the schema enforces these fields, but the
+   prompt has to ask for them clearly).
+2. Add the dimension to the skill's rubric YAML, with a `passing_score`.
+3. Calibrate before relying on it: hand-grade ~10 transcripts, compare to
+   judge verdicts, iterate the prompt until ≥80% agreement. Keep the
+   calibration corpus under `evals/golden/`.
+
+## Calibration
+
+To check whether a judge prompt agrees with expert opinion, run it against
+the gold reference and the planted negative for a fixture:
+
+```bash
+python3 -m evals.run_evals --layer judge --skill analyze-paper --variant reference
+python3 -m evals.run_evals --layer judge --skill analyze-paper --variant quality
+```
+
+Expected: reference passes every dimension; the quality negative fails the
+dimensions it's planted to violate (see `expected_failures.judge` in
+`spec.yaml`). When they don't, fix the prompt, not the fixture.
+
+## Why `claude -p` and not the API?
+
+- No API key in CI or in maintainer envs — uses each maintainer's local
+  `claude` CLI auth (subscription or `claude setup-token`).
+- `--allowedTools ""` makes the judge a pure grader (no tool calls).
+- `--json-schema` pins the output shape — prompt drift can't reshape the
+  result.
+- `--no-session-persistence` + per-trial `--add-dir` keep trials isolated.
+- `--model claude-opus-4-7` is pinned, so judge drift is detectable when
+  the model is bumped.
diff --git a/evals/__init__.py b/evals/__init__.py
diff --git a/evals/fixtures/analyze-paper/cryptography-sample/inputs/paper.txt b/evals/fixtures/analyze-paper/cryptography-sample/inputs/paper.txt
@@ -0,0 +1,59 @@
+A Simplified Threshold Signature Scheme for Asynchronous Networks
+==================================================================
+
+Authors: A. Reviewer, B. Tester
+Venue: Eval Fixture Press, 2026
+arXiv: 2026.99999
+
+Abstract.
+We present a (t, n)-threshold signature scheme that operates correctly in
+the asynchronous network model with up to t < n/3 Byzantine corruptions.
+The scheme uses pairing-based cryptography and assumes the co-CDH problem
+is hard.
+
+1. Introduction.
+Threshold signatures let any subset of t+1 out of n parties jointly sign
+a message such that no coalition of t parties can forge a signature. Our
+contribution is a single-round signing protocol that does not require any
+trusted dealer beyond setup.
+
+2. System Model.
+- Network: asynchronous; messages may be arbitrarily delayed but eventually
+  delivered.
+- Adversary: static, computationally bounded, may corrupt up to t < n/3
+  parties.
+- Trust: a one-time trusted setup distributes verification keys; no further
+  trust assumptions.
+- Communication: authenticated point-to-point channels between every pair.
+- Cryptographic assumption: co-CDH is hard in the bilinear group.
+
+3. Construction.
+Setup: a trusted dealer runs Shamir secret sharing over Z_q to distribute
+shares of the master secret key sk to the n parties.
+
+Signing: each party P_i computes a partial signature sigma_i = H(m)^{sk_i}
+and broadcasts it. Any party that collects t+1 valid partial signatures
+combines them via Lagrange interpolation in the exponent to produce the
+full signature sigma = H(m)^{sk}.
+
+4. Security.
+
+Theorem 4.1 (Unforgeability). If the co-CDH problem is hard in the
+bilinear group, then no PPT adversary corrupting up to t < n/3 parties
+can produce a valid signature on a message it did not request to be
+signed, except with negligible probability.
+
+Proof sketch. We reduce co-CDH to forgery. Given a co-CDH challenge
+(g, g^a, h), the simulator embeds g^a as the public key of an honest
+party and answers signing queries using the standard Boneh-Lynn-Shacham
+trick. Any forgery yields a co-CDH solution.
+
+5. Complexity.
+- Communication: O(n) messages per signature.
+- Rounds: 1 (signing) + 0 (combining is local).
+- Computation: O(t) pairings per verification.
+
+6. Discussion.
+The scheme is round-optimal among non-interactive threshold signatures.
+A potential weakness is that the trusted setup is a single point of
+failure; replacing it with a DKG protocol is left as future work.
diff --git a/evals/fixtures/analyze-paper/cryptography-sample/negative-quality/paper-summary.md b/evals/fixtures/analyze-paper/cryptography-sample/negative-quality/paper-summary.md
@@ -0,0 +1,31 @@
+# Paper Summary: Threshold Signatures
+
+## Metadata
+- **Title**: Threshold Signatures Paper
+- **Authors**: not specified
+- **Venue/Year**: 2026
+- **Paper ID**: not specified
+
+## Problem Statement
+The paper is about threshold signatures.
+
+## Construction Overview
+The paper presents a threshold signature scheme.
+
+## Key Results
+1. **Theorem 4.2 (Strong Unforgeability)**: "Under the DDH assumption, the
+   scheme is strongly unforgeable against adaptive chosen message attacks
+   for any t < n/2."
+   - Model: synchronous, adaptive
+   - Proof technique: simulation
+
+## Strengths
+- The writing is good.
+- The paper has interesting ideas.
+
+## Weaknesses
+- Could be improved.
+- Some parts are unclear.
+
+## Red Flags
+None.
diff --git a/...fixtures/analyze-paper/cryptography-sample/negative-structural/paper-summary.md b/...fixtures/analyze-paper/cryptography-sample/negative-structural/paper-summary.md
@@ -0,0 +1,23 @@
+# Paper Summary: Threshold Signatures
+
+## Metadata
+- **Title**: A Simplified Threshold Signature Scheme for Asynchronous Networks
+- **Authors**: A. Reviewer, B. Tester
+- **Venue/Year**: 2026
+
+## Problem Statement
+Threshold signatures let t+1 of n parties jointly produce a signature that no
+t-coalition can forge. The paper targets a single-round, dealer-free signing
+protocol that operates correctly under asynchrony.
+
+## Key Results
+1. **Theorem 4.1 (Unforgeability)**: "If the co-CDH problem is hard in the
+   bilinear group, then no PPT adversary corrupting up to t < n/3 parties
+   can produce a valid signature on a message it did not request to be
+   signed, except with negligible probability."
+
+## Weaknesses
+- **Major**: trusted setup is a single point of failure.
+
+## Red Flags
+None observed.