Add layered eval system with Claude CLI as judge by SebastianElvis · Pull Request #58 · SebastianElvis/reaper

SebastianElvis · 2026-05-02T04:35:06Z

Summary

Three-layer evaluation following Anthropic's Demystifying Evals for AI Agents. The judge is the local claude CLI — no API key, runs on the maintainer's subscription.

L1 Structural (evals/graders/): code-based, deterministic, runs in CI. Required sections, lengths, broken-ref detection, keep-or-discard cycle invariant.
L2 LLM judge (evals/judge/judge.py): wraps claude -p with --tools "", structured-output JSON schema, pinned claude-opus-4-7, isolated trials, anti-confabulation grounding check.
Orchestrator (evals/run_evals.py): stages variants in clean evals/runs/<run-id>/, emits md + json reports tagged with CLI version.

Fixtures pair a gold-standard reference with planted negatives — one for L1 (drops sections) and one for L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression.

analyze-paper/cryptography-sample is the seed fixture. pass^3 self-consistency: reference 2/2/2 every run; quality-negative 0/0/0 every run.

Codex review (independent) flagged and we fixed: P1 threshold bypass (rubric passing_score was being ignored), unicode read failures, loose cycle-decision regex, missing prompt-cache salt, missing CLI version logging. Also caught a real analyze-paper/SKILL.md contradiction (sections "proportional" vs system model "no missing dimensions") and a judge prompt bug (completeness writing summaries instead of verbatim heading quotes).

CLAUDE.md and README.md updated; CLAUDE.md links the Demystifying Evals guide as the canonical reference.

Test plan

pytest tests/ — 50/50 passing (28 existing + 11 new L1 + 10 search-paper integration + 1 new anchored-regex test)
python3 -m evals.run_evals --layer structural — clean reference passes, structural-negative trips has_sections exactly as designed
python3 -m evals.run_evals --layer all --skill analyze-paper — reference passes all dimensions; quality-negative fails all dimensions; pass^3 = 1.0 across 3 trials
CI evals-fast job added (mirrors tests/test_skill_outputs.py)

🤖 Generated with Claude Code

Three-layer evaluation following Anthropic's *Demystifying Evals for AI Agents*: - L1 structural graders (`evals/graders/`) wired into CI via the new `evals-fast` pytest job — required sections, lengths, broken-ref detection, and the keep-or-discard cycle invariant. - L2 LLM judge (`evals/judge/judge.py`) wraps `claude -p` with `--tools ""`, structured-output JSON schema, pinned `claude-opus-4-7`, per-trial salt, and isolated runs. Uses the user's local CLI auth — no API key required. Per-dimension prompts under `evals/judge/prompts/` (groundedness, specificity, completeness) with verbatim-evidence requirement, "unknown" escape hatch, and a post-hoc grounding check. - Orchestrator (`evals/run_evals.py`) stages each variant in a clean `evals/runs/<run-id>/`, applies the layers selected on the CLI, and emits markdown + JSON reports tagged with the CLI version. Fixtures pair a gold-standard reference with planted negatives — one targeting L1 (drops sections) and one targeting L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression. `analyze-paper/cryptography-sample` is the seed fixture; pass^3 self-consistency: reference 2/2/2 every run, quality-negative 0/0/0 every run. Resolves a real `analyze-paper/SKILL.md` contradiction surfaced by the eval design: "sections proportional to what the paper warrants" vs "system model is complete — no missing dimensions". Reworded so the proportionality wins and partial fills are forbidden. CLAUDE.md and README.md updated; CLAUDE.md links the *Demystifying Evals* guide as the canonical reference for eval design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

## Summary - Document the layered L1/L2/L3 eval system shipped in #58 (graders, judge, rubrics, fixtures, orchestrator) — previously the ROADMAP only mentioned \`evals.json\`. - Add missing \`[x]\` entries for \`/clarify-goal\` and \`/brainstorm\` (already built); update the design-decisions skill list accordingly. - Replace stale references: \`cross-verify\` → \`/critique\`; "11 skill directories" → 10. - Split the unchecked end-to-end test task into actionable follow-ups: source the 3 paper PDFs, expand L1+L2 fixture coverage beyond \`analyze-paper\`, calibrate judge prompts against \`evals/golden/\`. No status changes to H2–H6: unchecked items there (LaTeX synthesis, multi-model backends beyond Codex, \`search-dblp/scholar/venue\`, evidence taxonomy, proactive reformulation) genuinely remain undone. ## Test plan - [x] \`git diff origin/main...\` reviewed — only \`dev/ROADMAP.md\` changes - [ ] N/A: docs-only change, no code paths affected 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

SebastianElvis merged commit c4c543b into main May 2, 2026
4 checks passed

SebastianElvis mentioned this pull request May 2, 2026

Revise ROADMAP to reflect current implementation status #59

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add layered eval system with Claude CLI as judge#58

Add layered eval system with Claude CLI as judge#58
SebastianElvis merged 1 commit into
mainfrom
SebastianElvis/skill-evals-plan

SebastianElvis commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SebastianElvis commented May 2, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant