Add layered eval system with Claude CLI as judge#58
Merged
Conversation
Three-layer evaluation following Anthropic's *Demystifying Evals for AI Agents*: - L1 structural graders (`evals/graders/`) wired into CI via the new `evals-fast` pytest job — required sections, lengths, broken-ref detection, and the keep-or-discard cycle invariant. - L2 LLM judge (`evals/judge/judge.py`) wraps `claude -p` with `--tools ""`, structured-output JSON schema, pinned `claude-opus-4-7`, per-trial salt, and isolated runs. Uses the user's local CLI auth — no API key required. Per-dimension prompts under `evals/judge/prompts/` (groundedness, specificity, completeness) with verbatim-evidence requirement, "unknown" escape hatch, and a post-hoc grounding check. - Orchestrator (`evals/run_evals.py`) stages each variant in a clean `evals/runs/<run-id>/`, applies the layers selected on the CLI, and emits markdown + JSON reports tagged with the CLI version. Fixtures pair a gold-standard reference with planted negatives — one targeting L1 (drops sections) and one targeting L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression. `analyze-paper/cryptography-sample` is the seed fixture; pass^3 self-consistency: reference 2/2/2 every run, quality-negative 0/0/0 every run. Resolves a real `analyze-paper/SKILL.md` contradiction surfaced by the eval design: "sections proportional to what the paper warrants" vs "system model is complete — no missing dimensions". Reworded so the proportionality wins and partial fills are forbidden. CLAUDE.md and README.md updated; CLAUDE.md links the *Demystifying Evals* guide as the canonical reference for eval design. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2 tasks
SebastianElvis
added a commit
that referenced
this pull request
May 2, 2026
## Summary - Document the layered L1/L2/L3 eval system shipped in #58 (graders, judge, rubrics, fixtures, orchestrator) — previously the ROADMAP only mentioned \`evals.json\`. - Add missing \`[x]\` entries for \`/clarify-goal\` and \`/brainstorm\` (already built); update the design-decisions skill list accordingly. - Replace stale references: \`cross-verify\` → \`/critique\`; "11 skill directories" → 10. - Split the unchecked end-to-end test task into actionable follow-ups: source the 3 paper PDFs, expand L1+L2 fixture coverage beyond \`analyze-paper\`, calibrate judge prompts against \`evals/golden/\`. No status changes to H2–H6: unchecked items there (LaTeX synthesis, multi-model backends beyond Codex, \`search-dblp/scholar/venue\`, evidence taxonomy, proactive reformulation) genuinely remain undone. ## Test plan - [x] \`git diff origin/main...\` reviewed — only \`dev/ROADMAP.md\` changes - [ ] N/A: docs-only change, no code paths affected 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three-layer evaluation following Anthropic's Demystifying Evals for AI Agents. The judge is the local
claudeCLI — no API key, runs on the maintainer's subscription.evals/graders/): code-based, deterministic, runs in CI. Required sections, lengths, broken-ref detection, keep-or-discard cycle invariant.evals/judge/judge.py): wrapsclaude -pwith--tools "", structured-output JSON schema, pinnedclaude-opus-4-7, isolated trials, anti-confabulation grounding check.evals/run_evals.py): stages variants in cleanevals/runs/<run-id>/, emits md + json reports tagged with CLI version.Fixtures pair a gold-standard reference with planted negatives — one for L1 (drops sections) and one for L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression.
analyze-paper/cryptography-sampleis the seed fixture. pass^3 self-consistency: reference 2/2/2 every run; quality-negative 0/0/0 every run.Codex review (independent) flagged and we fixed: P1 threshold bypass (rubric
passing_scorewas being ignored), unicode read failures, loose cycle-decision regex, missing prompt-cache salt, missing CLI version logging. Also caught a realanalyze-paper/SKILL.mdcontradiction (sections "proportional" vs system model "no missing dimensions") and a judge prompt bug (completeness writing summaries instead of verbatim heading quotes).CLAUDE.md and README.md updated; CLAUDE.md links the Demystifying Evals guide as the canonical reference.
Test plan
pytest tests/— 50/50 passing (28 existing + 11 new L1 + 10 search-paper integration + 1 new anchored-regex test)python3 -m evals.run_evals --layer structural— clean reference passes, structural-negative tripshas_sectionsexactly as designedpython3 -m evals.run_evals --layer all --skill analyze-paper— reference passes all dimensions; quality-negative fails all dimensions; pass^3 = 1.0 across 3 trialsevals-fastjob added (mirrorstests/test_skill_outputs.py)🤖 Generated with Claude Code