Skip to content

feat: replace custom eval graders with LLM-as-judge scoring#47

Merged
VarunGitGood merged 2 commits into
mainfrom
feat/evaluation-judge
May 31, 2026
Merged

feat: replace custom eval graders with LLM-as-judge scoring#47
VarunGitGood merged 2 commits into
mainfrom
feat/evaluation-judge

Conversation

@VarunGitGood
Copy link
Copy Markdown
Owner

Summary

  • Removes ~230 lines of brittle hand-coded graders (grade_dataset_1/2/3, _mentions(), regex alternates tables) from eval/run_evals.py
  • Replaces them with a single LLMJudge class that scores investigation answers against criteria derived from expected.json
  • Produces per-criterion numeric scores (0.0–1.0) with explanations instead of binary pass/fail
  • Adds --judge-provider, --judge-model, --judge-api-key CLI args (defaults to configured provider)
  • Pass threshold: 0.8 aggregate score

New files

File Purpose
eval/results.py CriterionScore and JudgeResult pydantic models
eval/criteria.py build_criteria() — extracts scoring dimensions from expected.json keys generically
eval/judge.py LLMJudge class, judge system prompt, response parser, deterministic_precheck()
tests/eval/test_criteria.py 13 tests — criteria extraction from all 3 dataset shapes
tests/eval/test_judge.py 11 tests — precheck, response parsing, mock LLM scoring

Verified

  • Dataset 1 scored 0.94 with Mistral judge (trigger: 1.0, root cause: 0.9, chain: 0.8, red herrings: 1.0, confidence: 1.0)
  • 98/98 tests pass (uv run pytest tests/ -v)
  • No per-dataset grader code — new datasets only need expected.json, no code changes

Closes #44

Test plan

  • uv run pytest tests/eval/ -v — 24 tests pass
  • uv run pytest tests/ -v — full suite 98 tests pass
  • uv run python eval/run_evals.py --dataset dataset_1 — produces numeric scores
  • Run all 3 datasets with a non-rate-limited provider to validate end-to-end

Remove ~230 lines of brittle regex/substring graders (grade_dataset_1/2/3,
_mentions, alternates tables) and replace with a single LLMJudge that scores
investigation answers against criteria derived from expected.json.

New modules: eval/results.py (score models), eval/criteria.py (criteria
extraction from expected.json keys), eval/judge.py (LLMJudge class + prompt).
Adds --judge-provider, --judge-model, --judge-api-key CLI args to run_evals.py.
Pass threshold: 0.8 aggregate score.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 31, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
repi Ready Ready Preview, Comment May 31, 2026 7:35am

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the per-dataset, hand-coded evaluation graders in eval/run_evals.py with an LLM-as-judge approach that scores investigation outputs against criteria derived from each dataset’s expected.json, producing per-criterion numeric scores and an aggregate pass/fail based on a 0.8 threshold.

Changes:

  • Adds a criteria extraction layer (eval/criteria.py) and Pydantic result models (eval/results.py) to drive judge prompts and structured outputs.
  • Introduces an LLMJudge scorer (eval/judge.py) with deterministic prechecks and judge-response parsing.
  • Updates the eval runner (eval/run_evals.py) to run the judge, persist numeric results to bug.json, and adds CLI flags for judge provider/model/key; adds test coverage and pytest config.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
eval/run_evals.py Replaces dataset-specific graders with LLM judge scoring + CLI configuration and result output.
eval/judge.py Implements judge prompt, deterministic precheck, and parsing of judge outputs into structured results.
eval/criteria.py Extracts criterion names and criteria text from expected.json to drive judging.
eval/results.py Adds Pydantic models for per-criterion scores and overall judge results.
tests/eval/test_criteria.py Adds tests for criteria extraction across dataset shapes.
tests/eval/test_judge.py Adds tests for precheck, judge parsing behavior, and judge scoring flow.
pyproject.toml Configures pytest asyncio strict mode and sets pythonpath for tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eval/run_evals.py Outdated
Comment on lines +21 to +23
from repi.core.container import get_container
from eval.judge import LLMJudge, deterministic_precheck, PASS_THRESHOLD
from eval.results import JudgeResult
Comment thread eval/run_evals.py Outdated
Comment on lines +96 to +112
if args.get("judge_model"):
from repi.llm.adapters import (
OpenAIProvider, AnthropicProvider, MistralProvider,
GeminiProvider, OllamaProvider,
)
model = args["judge_model"]
if isinstance(llm, OpenAIProvider):
llm = OpenAIProvider(api_key=llm._api_key, model=model)
elif isinstance(llm, AnthropicProvider):
llm = AnthropicProvider(api_key=llm._api_key, model=model)
elif isinstance(llm, MistralProvider):
llm = MistralProvider(api_key=llm._api_key, model=model)
elif isinstance(llm, GeminiProvider):
llm = GeminiProvider(api_key=llm._api_key, model=model)
elif isinstance(llm, OllamaProvider):
llm = OllamaProvider(base_url=llm._base_url, model=model)

Comment thread eval/criteria.py Outdated
Comment thread eval/criteria.py
Comment on lines +130 to +141
def _red_herring_criterion(ea: dict) -> str:
ruled_out = ea.get("ruled_out_hypotheses_must_include")
if not ruled_out:
return ""

lines = ["## Criterion: red_herring_handling"]
lines.append("The ruled_out_hypotheses must address these red herrings:")
for item in ruled_out:
about = item.get("hypothesis_about", "?")
rationale = item.get("rationale", "")
lines.append(f" - {about}: {rationale}")
return "\n".join(lines)
Comment thread eval/criteria.py Outdated
Comment on lines +13 to +23
# Maps expected.json keys to the criterion name the judge will score.
# Order here determines evaluation order in the prompt.
_CRITERION_KEYS = [
"trigger_identification",
"root_cause_accuracy",
"propagation_chain",
"red_herring_handling",
"confidence_calibration",
"gap_awareness",
"hallucination_avoidance",
]
Comment thread eval/judge.py Outdated
Comment on lines +93 to +113
parsed = json.loads(cleaned)
scores_raw = parsed.get("scores", [])

criteria: list[CriterionScore] = []
for entry in scores_raw:
criteria.append(CriterionScore(
name=entry["name"],
score=max(0.0, min(1.0, float(entry["score"]))),
explanation=entry.get("explanation", ""),
))

scored_names = {c.name for c in criteria}
for name in criterion_names:
if name not in scored_names:
criteria.append(CriterionScore(
name=name,
score=0.0,
explanation="Judge did not return a score for this criterion.",
))

aggregate = sum(c.score for c in criteria) / len(criteria) if criteria else 0.0
- criteria.py: handle ruled_out_hypotheses_must_consider (dataset 2),
  only add confidence_calibration when present, remove dead _CRITERION_KEYS
- judge.py: graceful JSON parse failure, filter extra criteria from
  aggregate (only score requested criterion_names)
- run_evals.py: remove unused JudgeResult import, read API keys from
  config settings instead of private adapter attributes
- tests: add coverage for invalid JSON, extra criteria filtering,
  dataset 2 ruled_out_hypotheses_must_consider
@VarunGitGood VarunGitGood merged commit 3b1f74f into main May 31, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

E4: Replace custom eval graders with LLM-as-judge scoring

2 participants