Skip to content

Add layered eval system with Claude CLI as judge#58

Merged
SebastianElvis merged 1 commit into
mainfrom
SebastianElvis/skill-evals-plan
May 2, 2026
Merged

Add layered eval system with Claude CLI as judge#58
SebastianElvis merged 1 commit into
mainfrom
SebastianElvis/skill-evals-plan

Conversation

@SebastianElvis
Copy link
Copy Markdown
Owner

Summary

Three-layer evaluation following Anthropic's Demystifying Evals for AI Agents. The judge is the local claude CLI — no API key, runs on the maintainer's subscription.

  • L1 Structural (evals/graders/): code-based, deterministic, runs in CI. Required sections, lengths, broken-ref detection, keep-or-discard cycle invariant.
  • L2 LLM judge (evals/judge/judge.py): wraps claude -p with --tools "", structured-output JSON schema, pinned claude-opus-4-7, isolated trials, anti-confabulation grounding check.
  • Orchestrator (evals/run_evals.py): stages variants in clean evals/runs/<run-id>/, emits md + json reports tagged with CLI version.

Fixtures pair a gold-standard reference with planted negatives — one for L1 (drops sections) and one for L2 (fabricated theorem statements, generic content) — so a permissive grader fails CI as visibly as a missed regression.

analyze-paper/cryptography-sample is the seed fixture. pass^3 self-consistency: reference 2/2/2 every run; quality-negative 0/0/0 every run.

Codex review (independent) flagged and we fixed: P1 threshold bypass (rubric passing_score was being ignored), unicode read failures, loose cycle-decision regex, missing prompt-cache salt, missing CLI version logging. Also caught a real analyze-paper/SKILL.md contradiction (sections "proportional" vs system model "no missing dimensions") and a judge prompt bug (completeness writing summaries instead of verbatim heading quotes).

CLAUDE.md and README.md updated; CLAUDE.md links the Demystifying Evals guide as the canonical reference.

Test plan

  • pytest tests/ — 50/50 passing (28 existing + 11 new L1 + 10 search-paper integration + 1 new anchored-regex test)
  • python3 -m evals.run_evals --layer structural — clean reference passes, structural-negative trips has_sections exactly as designed
  • python3 -m evals.run_evals --layer all --skill analyze-paper — reference passes all dimensions; quality-negative fails all dimensions; pass^3 = 1.0 across 3 trials
  • CI evals-fast job added (mirrors tests/test_skill_outputs.py)

🤖 Generated with Claude Code

Three-layer evaluation following Anthropic's *Demystifying Evals for AI
Agents*:

- L1 structural graders (`evals/graders/`) wired into CI via the new
  `evals-fast` pytest job — required sections, lengths, broken-ref
  detection, and the keep-or-discard cycle invariant.
- L2 LLM judge (`evals/judge/judge.py`) wraps `claude -p` with
  `--tools ""`, structured-output JSON schema, pinned `claude-opus-4-7`,
  per-trial salt, and isolated runs. Uses the user's local CLI auth —
  no API key required. Per-dimension prompts under `evals/judge/prompts/`
  (groundedness, specificity, completeness) with verbatim-evidence
  requirement, "unknown" escape hatch, and a post-hoc grounding check.
- Orchestrator (`evals/run_evals.py`) stages each variant in a clean
  `evals/runs/<run-id>/`, applies the layers selected on the CLI, and
  emits markdown + JSON reports tagged with the CLI version.

Fixtures pair a gold-standard reference with planted negatives — one
targeting L1 (drops sections) and one targeting L2 (fabricated theorem
statements, generic content) — so a permissive grader fails CI as
visibly as a missed regression. `analyze-paper/cryptography-sample` is
the seed fixture; pass^3 self-consistency: reference 2/2/2 every run,
quality-negative 0/0/0 every run.

Resolves a real `analyze-paper/SKILL.md` contradiction surfaced by the
eval design: "sections proportional to what the paper warrants" vs
"system model is complete — no missing dimensions". Reworded so the
proportionality wins and partial fills are forbidden.

CLAUDE.md and README.md updated; CLAUDE.md links the *Demystifying
Evals* guide as the canonical reference for eval design.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@SebastianElvis SebastianElvis merged commit c4c543b into main May 2, 2026
4 checks passed
SebastianElvis added a commit that referenced this pull request May 2, 2026
## Summary
- Document the layered L1/L2/L3 eval system shipped in #58 (graders,
judge, rubrics, fixtures, orchestrator) — previously the ROADMAP only
mentioned \`evals.json\`.
- Add missing \`[x]\` entries for \`/clarify-goal\` and \`/brainstorm\`
(already built); update the design-decisions skill list accordingly.
- Replace stale references: \`cross-verify\` → \`/critique\`; "11 skill
directories" → 10.
- Split the unchecked end-to-end test task into actionable follow-ups:
source the 3 paper PDFs, expand L1+L2 fixture coverage beyond
\`analyze-paper\`, calibrate judge prompts against \`evals/golden/\`.

No status changes to H2–H6: unchecked items there (LaTeX synthesis,
multi-model backends beyond Codex, \`search-dblp/scholar/venue\`,
evidence taxonomy, proactive reformulation) genuinely remain undone.

## Test plan
- [x] \`git diff origin/main...\` reviewed — only \`dev/ROADMAP.md\`
changes
- [ ] N/A: docs-only change, no code paths affected

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant