Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,22 @@ jobs:
- name: Run structure tests
run: pytest tests/test_skills_structure.py -v

# L1 structural eval tests — required. Drive the code-based graders in
# evals/graders/ against committed fixtures. Cheap, deterministic, no
# network or LLM calls. The matching L2 LLM-judge runs are invoked
# locally / on demand (see evals/README.md), not in CI.
evals-fast:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- name: Install Python dependencies
run: pip install pytest pyyaml
- name: Run L1 eval tests
run: pytest tests/test_skill_outputs.py -v

# Network-dependent integration tests against live arXiv / IACR ePrint APIs.
# Non-blocking: external services can rate-limit, return 5xx, or change
# their HTML — none of which means the package is broken. We still run
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ __pycache__/
*.pyc
.context/
reaper-workspace/
evals/runs/
evals/reports/
15 changes: 10 additions & 5 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,20 +9,24 @@ AI-native scientific research pipeline distributed as a host-agnostic skills pac
- `/clarify-goal` — Interactive goal clarification (asks user targeted questions before pipeline runs)
- `/analyze-paper`, `/review-literature`, `/formalize-problem`, `/brainstorm`, `/investigate`, `/critique`, `/synthesize` — Pipeline stages
- `/search-paper` — Academic search + citation graph + venue resolution. Bundles five Python drivers (`arxiv.py`, `iacr.py`, `semantic_scholar.py`, `dblp.py`, `openalex.py`); the `SKILL.md` itself orchestrates the layered venue lookup.
- `tests/` — Python tests for skill structure and search scripts
- `evals/` — Test cases with quality criteria (`evals.json`)
- `tests/` — Python tests for skill structure, search scripts, and L1 eval graders
- `evals/` — Layered evaluation system. L1 code-based graders (`graders/`), L2 Claude-CLI LLM judges (`judge/`), per-skill rubrics (`rubrics/`), and fixtures with reference + planted-negative variants. Orchestrator: `python3 -m evals.run_evals`. See `evals/README.md`.
- `dev/` — Development docs including `ROADMAP.md` (full methodology and design)
- `.claude-plugin/` — Claude-Code-specific plugin manifest (`plugin.json`, `marketplace.json`); other hosts ignore this directory
- `.github/workflows/` — CI (pytest + strict `npx skills` discovery check that asserts every expected skill, script, and reference file is present after installation)

## Commands

```bash
# Run tests
# Run tests (includes L1 structural eval graders)
pytest tests/

# Python dependencies for search skills
pip install arxiv requests beautifulsoup4
# Run the layered evals
python3 -m evals.run_evals --layer structural # L1 only — no LLM, what CI runs
python3 -m evals.run_evals --layer all --skill analyze-paper # L1 + L2 (uses local `claude` CLI)

# Python dependencies for search skills + evals
pip install arxiv requests beautifulsoup4 pyyaml
```

## Key conventions
Expand All @@ -43,6 +47,7 @@ pip install arxiv requests beautifulsoup4
- When cutting a release tag, the tag message should summarize changes since the last tag (use `git log <last-tag>..HEAD`).
- Always use squash merge for PRs.
- Before finishing a task, check if important docs (README.md, CLAUDE.md, dev/ROADMAP.md) need to be updated to reflect your changes.
- Eval discipline: skill changes that affect a graded artifact (sections, output shape, quality criteria) must keep the corresponding rule in `evals/run_evals.py::SKILL_STRUCTURAL_RULES` and the rubric under `evals/rubrics/<skill>.yaml` in sync. Add fixtures (one reference + at least one planted negative per layer) before claiming coverage for a new skill. Calibrate new judge dimensions against `evals/golden/` before relying on them. Eval design and authoring follow Anthropic's [*Demystifying Evals for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) — code-based vs model-based vs human grader split, per-dimension scoring with an "unknown" escape hatch, isolated trials, two-sided cases (both planted negatives and references), and `pass^k` for consistency. Read it before adding a new layer or rubric.

## Distribution

Expand Down
22 changes: 21 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,26 @@ reaper-workspace/

The workspace contract is host-agnostic — any agent that can read and write files in the working directory produces the same workspace structure.

## Evaluation

Skills ship with a layered evaluation system following Anthropic's [*Demystifying Evals for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents) methodology. The judge is the local `claude` CLI — no API key, just your existing subscription.

| Layer | Grader | Cadence | Scope |
|---|---|---|---|
| L1 Structural | Code (`evals/graders/`) | Every PR (CI) | Required sections, lengths, broken refs, keep-or-discard cycle invariant |
| L2 Skill rubric | `claude -p` with structured-output JSON schema (`evals/judge/`) | Locally / nightly | Per-skill quality dimensions: groundedness, specificity, completeness |
| L3 End-to-end | Both | Pre-release | Full `/reaper` pipeline against canonical cases |

```bash
# L1 only (no LLM) — same thing CI runs
python3 -m evals.run_evals --layer structural

# L1 + L2 (uses your local claude CLI)
python3 -m evals.run_evals --layer all --skill analyze-paper
```

Each fixture pairs a gold-standard reference with planted negatives — one targeting L1 (drops a required section) and one targeting L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression. See [`evals/README.md`](evals/README.md) for the full design and how to add a fixture.

## Methodology

Reaper's research loop follows six principles:
Expand All @@ -202,7 +222,7 @@ See [`dev/ROADMAP.md`](dev/ROADMAP.md) for the full methodology and development

See [`dev/ROADMAP.md`](dev/ROADMAP.md) for the full roadmap.

- **Horizon 1 (The Pipeline)**: Core skills, orchestrator, and eval framework — *complete; LaTeX report output planned*
- **Horizon 1 (The Pipeline)**: Core skills, orchestrator, and layered eval system (L1 structural graders + L2 Claude-CLI judges with rubrics, calibrated against planted negatives) — *complete; LaTeX report output and broader rubric coverage across all skills planned*
- **Horizon 2 (The Library)**: arXiv/ePrint search via Python scripts + citation graph + venue resolution (Semantic Scholar / DBLP / OpenAlex) — *complete*
- **Horizon 3 (The Committee)**: Multi-model critique via the `/critique` skill's `--codex` mode — *Codex complete, Gemini/DeepSeek/local planned*
- **Horizon 3.5 (The Polyglot)**: Cross-agent distribution via `npx skills` and host-agnostic skill prose — *complete; per-host orchestration polish ongoing*
Expand Down
114 changes: 114 additions & 0 deletions evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# Reaper evals

Layered evaluation for the Reaper skills, following [*Demystifying Evals
for AI Agents*](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents).
The model-based judge is the local `claude` CLI — no API key required.

## Three layers

| Layer | Grader | Cost | Cadence | What it covers |
|---|---|---|---|---|
| **L1 Structural** | Code (`evals/graders/`) | Free | Every PR (CI) | Required sections present, min lengths, no broken refs, keep-or-discard cycle invariant |
| **L2 Skill rubric** | `claude -p` (`evals/judge/`) | Subscription tokens | Locally / nightly | Per-skill quality dimensions (groundedness, specificity, completeness) |
| **L3 End-to-end** | Both | Subscription tokens | Pre-release | Full `/reaper` pipeline against the 3 cases in `evals/evals.json` |

## Layout

```
evals/
evals.json # case registry (kept for human reference)
fixtures/<skill>/<case>/ # one directory per fixture
spec.yaml # variant declarations + expected layer outcomes
inputs/ # what the skill consumes (paper text, etc.)
reference/ # gold-standard output (must pass every layer)
negative-structural/ # planted L1 violation (drops a section, etc.)
negative-quality/ # planted L2 violation (fabricated claim, etc.)
rubrics/<skill>.yaml # which dimensions apply, and their pass thresholds
judge/
judge.py # claude CLI wrapper, JSON-schema enforced
schemas/rubric.json # per-dimension structured-output shape
prompts/<dimension>.md # one judge persona per rubric dimension
graders/
structural.py # L1 assertion helpers (pure Python)
consistency.py # cycle invariant verifier
run_evals.py # orchestrator
runs/ # per-trial workspaces (gitignored)
reports/ # md + json reports (gitignored)
```

## Running

```bash
# L1 only — same thing CI runs (no claude CLI required)
python3 -m evals.run_evals --layer structural

# L2 only — judges every variant of every fixture (uses claude CLI)
python3 -m evals.run_evals --layer judge --skill analyze-paper

# Full run (L1 + L2)
python3 -m evals.run_evals --layer all --skill analyze-paper

# One variant of one case
python3 -m evals.run_evals --layer all --skill analyze-paper --variant reference
```

The orchestrator stages each variant into a fresh `evals/runs/<run-id>/`
directory before grading — per the eval guide's "isolated environments"
recommendation. Reports land in `evals/reports/<run-id>.{md,json}`.

The pytest entry point that CI uses lives at `tests/test_skill_outputs.py`
and exercises the same L1 graders.

## Adding a fixture

1. Create `evals/fixtures/<skill>/<case>/`.
2. Put what the skill consumes under `inputs/` (paper text, prior notes, etc.).
3. Hand-write a gold-standard output under `reference/`.
4. Write at least one **structural negative** (drops a required section, etc.)
under `negative-structural/` and one **quality negative** (fabricated
claim, generic content) under `negative-quality/`. One-sided evals create
one-sided optimization; both directions matter.
5. Declare the variants in `spec.yaml` (see the existing
`cryptography-sample` fixture). Each negative carries a `target_layer`
so the orchestrator knows which grader is supposed to fail.
6. If this is a new skill, add an entry to `SKILL_STRUCTURAL_RULES` in
`evals/run_evals.py` and a rubric file under `evals/rubrics/`.

`tests/test_skill_outputs.py::test_every_fixture_skill_has_rules` will fail
if you add a fixture without graders — coverage without graders is invisible.

## Adding a judge dimension

1. Drop a per-dimension prompt at `evals/judge/prompts/<dim>.md`. Lead with
the score scale, require a verbatim `evidence` quote, and include the
`"unknown"` escape hatch (the schema enforces these fields, but the
prompt has to ask for them clearly).
2. Add the dimension to the skill's rubric YAML, with a `passing_score`.
3. Calibrate before relying on it: hand-grade ~10 transcripts, compare to
judge verdicts, iterate the prompt until ≥80% agreement. Keep the
calibration corpus under `evals/golden/`.

## Calibration

To check whether a judge prompt agrees with expert opinion, run it against
the gold reference and the planted negative for a fixture:

```bash
python3 -m evals.run_evals --layer judge --skill analyze-paper --variant reference
python3 -m evals.run_evals --layer judge --skill analyze-paper --variant quality
```

Expected: reference passes every dimension; the quality negative fails the
dimensions it's planted to violate (see `expected_failures.judge` in
`spec.yaml`). When they don't, fix the prompt, not the fixture.

## Why `claude -p` and not the API?

- No API key in CI or in maintainer envs — uses each maintainer's local
`claude` CLI auth (subscription or `claude setup-token`).
- `--allowedTools ""` makes the judge a pure grader (no tool calls).
- `--json-schema` pins the output shape — prompt drift can't reshape the
result.
- `--no-session-persistence` + per-trial `--add-dir` keep trials isolated.
- `--model claude-opus-4-7` is pinned, so judge drift is detectable when
the model is bumped.
Empty file added evals/__init__.py
Empty file.
59 changes: 59 additions & 0 deletions evals/fixtures/analyze-paper/cryptography-sample/inputs/paper.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
A Simplified Threshold Signature Scheme for Asynchronous Networks
==================================================================

Authors: A. Reviewer, B. Tester
Venue: Eval Fixture Press, 2026
arXiv: 2026.99999

Abstract.
We present a (t, n)-threshold signature scheme that operates correctly in
the asynchronous network model with up to t < n/3 Byzantine corruptions.
The scheme uses pairing-based cryptography and assumes the co-CDH problem
is hard.

1. Introduction.
Threshold signatures let any subset of t+1 out of n parties jointly sign
a message such that no coalition of t parties can forge a signature. Our
contribution is a single-round signing protocol that does not require any
trusted dealer beyond setup.

2. System Model.
- Network: asynchronous; messages may be arbitrarily delayed but eventually
delivered.
- Adversary: static, computationally bounded, may corrupt up to t < n/3
parties.
- Trust: a one-time trusted setup distributes verification keys; no further
trust assumptions.
- Communication: authenticated point-to-point channels between every pair.
- Cryptographic assumption: co-CDH is hard in the bilinear group.

3. Construction.
Setup: a trusted dealer runs Shamir secret sharing over Z_q to distribute
shares of the master secret key sk to the n parties.

Signing: each party P_i computes a partial signature sigma_i = H(m)^{sk_i}
and broadcasts it. Any party that collects t+1 valid partial signatures
combines them via Lagrange interpolation in the exponent to produce the
full signature sigma = H(m)^{sk}.

4. Security.

Theorem 4.1 (Unforgeability). If the co-CDH problem is hard in the
bilinear group, then no PPT adversary corrupting up to t < n/3 parties
can produce a valid signature on a message it did not request to be
signed, except with negligible probability.

Proof sketch. We reduce co-CDH to forgery. Given a co-CDH challenge
(g, g^a, h), the simulator embeds g^a as the public key of an honest
party and answers signing queries using the standard Boneh-Lynn-Shacham
trick. Any forgery yields a co-CDH solution.

5. Complexity.
- Communication: O(n) messages per signature.
- Rounds: 1 (signing) + 0 (combining is local).
- Computation: O(t) pairings per verification.

6. Discussion.
The scheme is round-optimal among non-interactive threshold signatures.
A potential weakness is that the trusted setup is a single point of
failure; replacing it with a DKG protocol is left as future work.
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Paper Summary: Threshold Signatures

## Metadata
- **Title**: Threshold Signatures Paper
- **Authors**: not specified
- **Venue/Year**: 2026
- **Paper ID**: not specified

## Problem Statement
The paper is about threshold signatures.

## Construction Overview
The paper presents a threshold signature scheme.

## Key Results
1. **Theorem 4.2 (Strong Unforgeability)**: "Under the DDH assumption, the
scheme is strongly unforgeable against adaptive chosen message attacks
for any t < n/2."
- Model: synchronous, adaptive
- Proof technique: simulation

## Strengths
- The writing is good.
- The paper has interesting ideas.

## Weaknesses
- Could be improved.
- Some parts are unclear.

## Red Flags
None.
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Paper Summary: Threshold Signatures

## Metadata
- **Title**: A Simplified Threshold Signature Scheme for Asynchronous Networks
- **Authors**: A. Reviewer, B. Tester
- **Venue/Year**: 2026

## Problem Statement
Threshold signatures let t+1 of n parties jointly produce a signature that no
t-coalition can forge. The paper targets a single-round, dealer-free signing
protocol that operates correctly under asynchrony.

## Key Results
1. **Theorem 4.1 (Unforgeability)**: "If the co-CDH problem is hard in the
bilinear group, then no PPT adversary corrupting up to t < n/3 parties
can produce a valid signature on a message it did not request to be
signed, except with negligible probability."

## Weaknesses
- **Major**: trusted setup is a single point of failure.

## Red Flags
None observed.
Loading
Loading