Skip to content

feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard#50

Merged
VarunGitGood merged 4 commits into
mainfrom
feat/evaluation-judge
Jun 4, 2026
Merged

feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard#50
VarunGitGood merged 4 commits into
mainfrom
feat/evaluation-judge

Conversation

@VarunGitGood
Copy link
Copy Markdown
Owner

Summary

Finishes the evaluation-judge feature on this branch on top of #44 / #47:

  • Gather → compile split (Issue ReAct loop: split evidence gathering from answer compilation #48). The ReAct loop now only gathers evidence and exits via done_gathering; a separate compile-LLM call in repi/investigation/compiler.py turns the gathered evidence into a validated InvestigationAnswer. A deterministic synth is the last-resort fallback. repi/llm/json_utils.py is the new shared JSON parser used by both the loop and the eval judge.
  • Timezone hardening at the DB boundary. asyncpg/timestamptz interprets naive datetimes through the process's local timezone, which silently shifts every write and query window on a non-UTC host. New DateHandler.to_aware_utc() is now applied at every read/write site that crosses the asyncpg/SQLAlchemy boundary; LogChunk.timestamp_{start,end} are declared DateTime(timezone=True) to match the TIMESTAMPTZ columns. Seed scripts (eval/dataset_*/seed.py) also pin their anchor weekday to UTC so seed and resolver share one clock.
  • Judge hard-test (eval/check_judge.py). Canned gold + fail answers per dataset, scored against the judge to verify it lands in the right buckets (gold ≥ 0.8, fail ≤ 0.5) before we trust it to rank real models. On Mistral self-grade today: gold 1.00, fail ≤ 0.26 across all 3 datasets — judge is trusted.
  • Reason-then-score judge prompt. eval/judge.py system prompt now forces explanation before score, reducing post-hoc rationalisation.
  • Leaderboard table. New leaderboard table in db/schema.sql (one row per (run_id, dataset) with criteria JSONB + stats + judge metadata). eval/run_evals.py auto-writes every run; failures are logged and never block the eval.
  • UI: SSE stream now emits phase_change events, step kind, and ships stats in the done payload; the investigation page renders them.

Test plan

  • uv run pytest tests/ -q120 passed (locally)
  • uv run python eval/check_judge.py --judge-provider mistral — all datasets ✓ (gold 1.00, fail ≤ 0.26)
  • uv run python eval/run_evals.py --dataset dataset_1 --judge-provider mistral — PASS (0.98)
  • uv run python eval/run_evals.py --dataset dataset_2 --judge-provider mistral — PASS (0.85)
  • uv run python eval/run_evals.py --dataset dataset_3 --judge-provider mistral — PASS (0.90)
  • Leaderboard rows persisted; criteria JSONB shape (5/6/6 entries) and stats payload verified via psql
  • Schema applied cleanly via make migrate; \d leaderboard shows all 5 indexes

- eval/check_judge.py: canned gold + fail answers per dataset, scored
  against the LLM judge to verify it lands in the right buckets
  (gold >= 0.8, fail <= 0.5) before we trust it to rank real models.
- eval/judge.py: judge prompt now forces explanation before score
  (reason-then-score) to reduce post-hoc rationalisation.
- eval/dataset_*/seed.py: pin anchor weekday to UTC so the seed and
  the resolver (which defaults to UTC) cannot drift by a day on a
  non-UTC host near midnight.
- db/schema.sql + eval/run_evals.py: new leaderboard table; every run
  now writes one row per (run_id, dataset) with criteria JSONB and
  stats. Best-effort: a DB failure never blocks the eval run.
asyncpg's `timestamptz` codec interprets naive Python datetimes through the
process's *local* timezone, which silently shifts every write and query
window by the host offset on a non-UTC host (e.g. IST/+05:30). Internally
the project treats datetimes as naive UTC, so the conversion has to happen
at the DB boundary.

- core/dates.py: new `DateHandler.to_aware_utc()` helper.
- models/schema.py: declare `LogChunk.timestamp_{start,end}` as
  `DateTime(timezone=True)` to match the TIMESTAMPTZ DB columns.
- investigation/sweep.py, investigation/tools.py,
  retrieval/filter_builder.py, retrieval/pgvector_store.py:
  attach UTC at every read/write site that crosses the asyncpg/SQLAlchemy
  boundary.
The ReAct loop now gathers evidence and exits via `done_gathering`; a
separate compile-LLM call in repi/investigation/compiler.py turns the
gathered evidence into a validated InvestigationAnswer. This narrows
each prompt to a single job, cuts schema-violation churn from the
investigation loop, and gives us a clean place to plug a deterministic
"unable to determine" synth as a last-resort fallback.

- investigation/react_loop.py: huge refactor. Loop replaces submit_answer
  with done_gathering, legacy submit_answer is still recognised as a
  done-gathering signal (its payload is discarded). parse_llm_response
  moves to repi/llm/json_utils.py and is re-exported for back-compat.
- investigation/compiler.py (new): compile_answer (LLM, one validation
  retry), enforce_floors hook, and synthesize_answer (non-LLM fallback).
- investigation/schema.py: floors + supporting types for compile.
- llm/json_utils.py (new): shared parser used by both the loop and the
  eval judge.
- llm/adapters.py: minor adjustments for compile-call usage.
- api/investigate.py: SSE stream now emits phase_change events, includes
  step `kind`, and ships stats in the done payload.
- web/*: render phase_change + kind + stats on the investigation page.
- eval/criteria.py: criterion builder tweaks aligned with new schema.
- tests/llm/, tests/investigation/test_compiler.py,
  test_react_loop_gathering.py: coverage for the new modules.
- tests/eval/test_judge.py, test_react_loop_{ledger,parser,reflection}.py:
  updated for the new wiring.
- CLAUDE.md, .gitignore: reflect the gather→compile flow and ignore
  legacy bug.json / eval/results artefacts.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
repi Ready Ready Preview, Comment Jun 4, 2026 5:55pm

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the “LLM-as-judge” eval harness by splitting the investigation pipeline into evidence gathering vs answer compilation, hardening UTC handling at the DB boundary, adding a leaderboard persistence layer, and updating the UI/SSE stream to surface phases and telemetry.

Changes:

  • Split investigation into a gather-only ReAct loop plus a separate compile step (repi/investigation/compiler.py) that produces a validated InvestigationAnswer with deterministic fallback.
  • Harden timestamp handling for TIMESTAMPTZ columns by attaching UTC tzinfo at asyncpg/SQLAlchemy boundaries and aligning ORM schema with timezone-aware columns.
  • Expand eval harness: judge prompt updates + parse retry, canned judge calibration script, leaderboard table + persistence, and UI phase/stats rendering via SSE.

Reviewed changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
web/lib/sse.ts Add phase/stats to SSE hook; add step kind typing.
web/components/investigation-step.tsx Render special step kinds (reflection/signal/compile) with distinct UI.
web/app/investigations/[id]/page.tsx Show phase indicator (“gathering/compiling”) and display run stats when available.
tests/llm/test_adapters.py Add tests for typed adapter error classification (429 vs other 4xx) and Mistral retry semantics.
tests/llm/init.py Package marker for llm tests.
tests/investigation/test_react_loop_submit_answer.py Remove legacy submit-answer finalize tests (loop no longer finalizes).
tests/investigation/test_react_loop_reflection.py Update reflection tests for done_gathering + compile step and separate reflection budget.
tests/investigation/test_react_loop_parser.py Switch parser tests to shared repi.llm.json_utils helpers.
tests/investigation/test_react_loop_ledger.py Update ledger tests to gather/compile split and done_gathering.
tests/investigation/test_react_loop_gathering.py New tests for gather-only behaviors (done signal, stall detection, null-action guard, stats).
tests/investigation/test_compiler.py New tests for compiler LLM call + validation retry + deterministic synthesis + floor enforcement.
tests/eval/test_judge.py Refactor judge tests for shared parser, parse-retry, and advisory precheck.
repi/retrieval/pgvector_store.py Attach UTC tzinfo before DB writes to TIMESTAMPTZ timestamp columns.
repi/retrieval/filter_builder.py Attach UTC tzinfo for time window filters at DB boundary.
repi/models/schema.py Declare timestamp_{start,end} as timezone-aware SQLAlchemy DateTime(timezone=True).
repi/llm/json_utils.py New shared robust JSON extraction/parser for LLM replies.
repi/llm/adapters.py Add _check_4xx, LLMRateLimitError, LLMBadRequestError; improve Mistral 429 handling.
repi/investigation/tools.py Ensure ISO timestamps become tz-aware UTC before hitting asyncpg.
repi/investigation/sweep.py Normalize sweep time window to tz-aware UTC at DB boundary.
repi/investigation/schema.py Add server-side enforce_floors confidence/consistency enforcement.
repi/investigation/react_loop.py Convert loop to gather-only with done_gathering, compile handoff, new stats, SSE phase hooks.
repi/investigation/compiler.py New compile phase LLM call with validation retry + deterministic synth fallback.
repi/core/dates.py Add DateHandler.to_aware_utc() for asyncpg/SQLAlchemy TIMESTAMPTZ boundary correctness.
repi/api/investigate.py SSE now emits step kind, phase_change events, and includes stats in done payload for live runs.
eval/run_evals.py Judge provider auto-selection (avoid self-grading), optional --out, leaderboard persistence, richer status.
eval/judge.py Reason-then-score prompt, shared parser usage, parse retry, deterministic scoring for some criteria.
eval/dataset_3_jwt_key_rotation_noise/seed.py Anchor seed weekday computation to UTC.
eval/dataset_2_insufficient_logging/seed.py Anchor seed weekday computation to UTC.
eval/dataset_1_cascading_inventory_migration/seed.py Anchor seed weekday computation to UTC.
eval/criteria.py Support building criteria subsets (for deterministic+LLM split scoring).
eval/check_judge.py New judge hard-test with canned gold/fail answers and thresholds.
db/schema.sql Add leaderboard table + indexes.
CLAUDE.md Update docs: eval output behavior, judge provider selection, gather/compile split.
bug.json Remove legacy bug output artifact from repo.
.gitignore Ignore legacy bug.json and optional eval output artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread repi/investigation/react_loop.py Outdated
Comment on lines +517 to +522
if self.store:
await self.store.increment_llm_calls(investigation_obj.id)

if self.store:
await self.store.increment_llm_calls(investigation_obj.id)

Comment on lines +266 to +286
evidence_ids = {c.get("chunk_id") for c in evidence if c.get("chunk_id")}
recent = recent_thoughts or []
validation_errors: Optional[list[str]] = None
last_parsed: dict = {}

for attempt in range(1, 3):
messages = _build_compile_messages(
query=query,
resolved_intent=resolved_intent,
evidence=evidence,
tool_ledger=tool_ledger,
recent_thoughts=recent,
validation_errors=validation_errors,
known_services=known_services,
)
try:
raw = await llm.complete(messages, max_tokens=max_tokens, temperature=0.0)
except Exception as exc:
logger.warning("Compiler LLM call attempt %d raised: %s", attempt, exc)
break

Comment on lines +314 to +319
return CompileResult(
answer=adjusted,
source="llm_invalid",
attempts=2,
floor_adjustments=adjustments,
)
Comment on lines +327 to +332
return CompileResult(
answer=synthesized,
source="deterministic",
attempts=2,
floor_adjustments=[],
)
Comment thread repi/models/schema.py Outdated
Comment on lines +5 to +7
from sqlmodel import SQLModel, Field, Column, JSON
from pgvector.sqlalchemy import Vector
from sqlalchemy import TEXT, ARRAY, Index, String, Column
from sqlalchemy import TEXT, ARRAY, Index, String, Column, DateTime
- react_loop: collapse the duplicate increment_llm_calls call at the
  gathering LLM site into a single guarded block, matching the
  reflection-path pattern. Removes double-counting of total_llm_calls.
- compiler: track attempts actually made and report it on the
  llm_invalid and deterministic fallback CompileResults (and in the
  user-visible gaps message). Previously these hard-coded attempts=2
  even when the first attempt raised, lying to the leaderboard stats.
- models/schema: drop the unused `Column` re-export from `sqlmodel`
  so it stops shadowing the SQLAlchemy `Column` that sa_column uses.
@VarunGitGood VarunGitGood merged commit 4ce11d9 into main Jun 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants