feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard#50
Merged
Conversation
- eval/check_judge.py: canned gold + fail answers per dataset, scored against the LLM judge to verify it lands in the right buckets (gold >= 0.8, fail <= 0.5) before we trust it to rank real models. - eval/judge.py: judge prompt now forces explanation before score (reason-then-score) to reduce post-hoc rationalisation. - eval/dataset_*/seed.py: pin anchor weekday to UTC so the seed and the resolver (which defaults to UTC) cannot drift by a day on a non-UTC host near midnight. - db/schema.sql + eval/run_evals.py: new leaderboard table; every run now writes one row per (run_id, dataset) with criteria JSONB and stats. Best-effort: a DB failure never blocks the eval run.
asyncpg's `timestamptz` codec interprets naive Python datetimes through the
process's *local* timezone, which silently shifts every write and query
window by the host offset on a non-UTC host (e.g. IST/+05:30). Internally
the project treats datetimes as naive UTC, so the conversion has to happen
at the DB boundary.
- core/dates.py: new `DateHandler.to_aware_utc()` helper.
- models/schema.py: declare `LogChunk.timestamp_{start,end}` as
`DateTime(timezone=True)` to match the TIMESTAMPTZ DB columns.
- investigation/sweep.py, investigation/tools.py,
retrieval/filter_builder.py, retrieval/pgvector_store.py:
attach UTC at every read/write site that crosses the asyncpg/SQLAlchemy
boundary.
The ReAct loop now gathers evidence and exits via `done_gathering`; a
separate compile-LLM call in repi/investigation/compiler.py turns the
gathered evidence into a validated InvestigationAnswer. This narrows
each prompt to a single job, cuts schema-violation churn from the
investigation loop, and gives us a clean place to plug a deterministic
"unable to determine" synth as a last-resort fallback.
- investigation/react_loop.py: huge refactor. Loop replaces submit_answer
with done_gathering, legacy submit_answer is still recognised as a
done-gathering signal (its payload is discarded). parse_llm_response
moves to repi/llm/json_utils.py and is re-exported for back-compat.
- investigation/compiler.py (new): compile_answer (LLM, one validation
retry), enforce_floors hook, and synthesize_answer (non-LLM fallback).
- investigation/schema.py: floors + supporting types for compile.
- llm/json_utils.py (new): shared parser used by both the loop and the
eval judge.
- llm/adapters.py: minor adjustments for compile-call usage.
- api/investigate.py: SSE stream now emits phase_change events, includes
step `kind`, and ships stats in the done payload.
- web/*: render phase_change + kind + stats on the investigation page.
- eval/criteria.py: criterion builder tweaks aligned with new schema.
- tests/llm/, tests/investigation/test_compiler.py,
test_react_loop_gathering.py: coverage for the new modules.
- tests/eval/test_judge.py, test_react_loop_{ledger,parser,reflection}.py:
updated for the new wiring.
- CLAUDE.md, .gitignore: reflect the gather→compile flow and ignore
legacy bug.json / eval/results artefacts.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
Pull request overview
This PR completes the “LLM-as-judge” eval harness by splitting the investigation pipeline into evidence gathering vs answer compilation, hardening UTC handling at the DB boundary, adding a leaderboard persistence layer, and updating the UI/SSE stream to surface phases and telemetry.
Changes:
- Split investigation into a gather-only ReAct loop plus a separate compile step (
repi/investigation/compiler.py) that produces a validatedInvestigationAnswerwith deterministic fallback. - Harden timestamp handling for
TIMESTAMPTZcolumns by attaching UTC tzinfo at asyncpg/SQLAlchemy boundaries and aligning ORM schema with timezone-aware columns. - Expand eval harness: judge prompt updates + parse retry, canned judge calibration script, leaderboard table + persistence, and UI phase/stats rendering via SSE.
Reviewed changes
Copilot reviewed 33 out of 35 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| web/lib/sse.ts | Add phase/stats to SSE hook; add step kind typing. |
| web/components/investigation-step.tsx | Render special step kinds (reflection/signal/compile) with distinct UI. |
| web/app/investigations/[id]/page.tsx | Show phase indicator (“gathering/compiling”) and display run stats when available. |
| tests/llm/test_adapters.py | Add tests for typed adapter error classification (429 vs other 4xx) and Mistral retry semantics. |
| tests/llm/init.py | Package marker for llm tests. |
| tests/investigation/test_react_loop_submit_answer.py | Remove legacy submit-answer finalize tests (loop no longer finalizes). |
| tests/investigation/test_react_loop_reflection.py | Update reflection tests for done_gathering + compile step and separate reflection budget. |
| tests/investigation/test_react_loop_parser.py | Switch parser tests to shared repi.llm.json_utils helpers. |
| tests/investigation/test_react_loop_ledger.py | Update ledger tests to gather/compile split and done_gathering. |
| tests/investigation/test_react_loop_gathering.py | New tests for gather-only behaviors (done signal, stall detection, null-action guard, stats). |
| tests/investigation/test_compiler.py | New tests for compiler LLM call + validation retry + deterministic synthesis + floor enforcement. |
| tests/eval/test_judge.py | Refactor judge tests for shared parser, parse-retry, and advisory precheck. |
| repi/retrieval/pgvector_store.py | Attach UTC tzinfo before DB writes to TIMESTAMPTZ timestamp columns. |
| repi/retrieval/filter_builder.py | Attach UTC tzinfo for time window filters at DB boundary. |
| repi/models/schema.py | Declare timestamp_{start,end} as timezone-aware SQLAlchemy DateTime(timezone=True). |
| repi/llm/json_utils.py | New shared robust JSON extraction/parser for LLM replies. |
| repi/llm/adapters.py | Add _check_4xx, LLMRateLimitError, LLMBadRequestError; improve Mistral 429 handling. |
| repi/investigation/tools.py | Ensure ISO timestamps become tz-aware UTC before hitting asyncpg. |
| repi/investigation/sweep.py | Normalize sweep time window to tz-aware UTC at DB boundary. |
| repi/investigation/schema.py | Add server-side enforce_floors confidence/consistency enforcement. |
| repi/investigation/react_loop.py | Convert loop to gather-only with done_gathering, compile handoff, new stats, SSE phase hooks. |
| repi/investigation/compiler.py | New compile phase LLM call with validation retry + deterministic synth fallback. |
| repi/core/dates.py | Add DateHandler.to_aware_utc() for asyncpg/SQLAlchemy TIMESTAMPTZ boundary correctness. |
| repi/api/investigate.py | SSE now emits step kind, phase_change events, and includes stats in done payload for live runs. |
| eval/run_evals.py | Judge provider auto-selection (avoid self-grading), optional --out, leaderboard persistence, richer status. |
| eval/judge.py | Reason-then-score prompt, shared parser usage, parse retry, deterministic scoring for some criteria. |
| eval/dataset_3_jwt_key_rotation_noise/seed.py | Anchor seed weekday computation to UTC. |
| eval/dataset_2_insufficient_logging/seed.py | Anchor seed weekday computation to UTC. |
| eval/dataset_1_cascading_inventory_migration/seed.py | Anchor seed weekday computation to UTC. |
| eval/criteria.py | Support building criteria subsets (for deterministic+LLM split scoring). |
| eval/check_judge.py | New judge hard-test with canned gold/fail answers and thresholds. |
| db/schema.sql | Add leaderboard table + indexes. |
| CLAUDE.md | Update docs: eval output behavior, judge provider selection, gather/compile split. |
| bug.json | Remove legacy bug output artifact from repo. |
| .gitignore | Ignore legacy bug.json and optional eval output artifacts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+517
to
+522
| if self.store: | ||
| await self.store.increment_llm_calls(investigation_obj.id) | ||
|
|
||
| if self.store: | ||
| await self.store.increment_llm_calls(investigation_obj.id) | ||
|
|
Comment on lines
+266
to
+286
| evidence_ids = {c.get("chunk_id") for c in evidence if c.get("chunk_id")} | ||
| recent = recent_thoughts or [] | ||
| validation_errors: Optional[list[str]] = None | ||
| last_parsed: dict = {} | ||
|
|
||
| for attempt in range(1, 3): | ||
| messages = _build_compile_messages( | ||
| query=query, | ||
| resolved_intent=resolved_intent, | ||
| evidence=evidence, | ||
| tool_ledger=tool_ledger, | ||
| recent_thoughts=recent, | ||
| validation_errors=validation_errors, | ||
| known_services=known_services, | ||
| ) | ||
| try: | ||
| raw = await llm.complete(messages, max_tokens=max_tokens, temperature=0.0) | ||
| except Exception as exc: | ||
| logger.warning("Compiler LLM call attempt %d raised: %s", attempt, exc) | ||
| break | ||
|
|
Comment on lines
+314
to
+319
| return CompileResult( | ||
| answer=adjusted, | ||
| source="llm_invalid", | ||
| attempts=2, | ||
| floor_adjustments=adjustments, | ||
| ) |
Comment on lines
+327
to
+332
| return CompileResult( | ||
| answer=synthesized, | ||
| source="deterministic", | ||
| attempts=2, | ||
| floor_adjustments=[], | ||
| ) |
Comment on lines
+5
to
+7
| from sqlmodel import SQLModel, Field, Column, JSON | ||
| from pgvector.sqlalchemy import Vector | ||
| from sqlalchemy import TEXT, ARRAY, Index, String, Column | ||
| from sqlalchemy import TEXT, ARRAY, Index, String, Column, DateTime |
- react_loop: collapse the duplicate increment_llm_calls call at the gathering LLM site into a single guarded block, matching the reflection-path pattern. Removes double-counting of total_llm_calls. - compiler: track attempts actually made and report it on the llm_invalid and deterministic fallback CompileResults (and in the user-visible gaps message). Previously these hard-coded attempts=2 even when the first attempt raised, lying to the leaderboard stats. - models/schema: drop the unused `Column` re-export from `sqlmodel` so it stops shadowing the SQLAlchemy `Column` that sa_column uses.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Finishes the evaluation-judge feature on this branch on top of #44 / #47:
done_gathering; a separate compile-LLM call inrepi/investigation/compiler.pyturns the gathered evidence into a validatedInvestigationAnswer. A deterministic synth is the last-resort fallback.repi/llm/json_utils.pyis the new shared JSON parser used by both the loop and the eval judge.asyncpg/timestamptzinterprets naive datetimes through the process's local timezone, which silently shifts every write and query window on a non-UTC host. NewDateHandler.to_aware_utc()is now applied at every read/write site that crosses the asyncpg/SQLAlchemy boundary;LogChunk.timestamp_{start,end}are declaredDateTime(timezone=True)to match theTIMESTAMPTZcolumns. Seed scripts (eval/dataset_*/seed.py) also pin their anchor weekday to UTC so seed and resolver share one clock.eval/check_judge.py). Canned gold + fail answers per dataset, scored against the judge to verify it lands in the right buckets (gold ≥ 0.8, fail ≤ 0.5) before we trust it to rank real models. On Mistral self-grade today: gold 1.00, fail ≤ 0.26 across all 3 datasets — judge is trusted.eval/judge.pysystem prompt now forces explanation before score, reducing post-hoc rationalisation.leaderboardtable indb/schema.sql(one row per(run_id, dataset)with criteria JSONB + stats + judge metadata).eval/run_evals.pyauto-writes every run; failures are logged and never block the eval.phase_changeevents, stepkind, and ships stats in thedonepayload; the investigation page renders them.Test plan
uv run pytest tests/ -q— 120 passed (locally)uv run python eval/check_judge.py --judge-provider mistral— all datasets ✓ (gold 1.00, fail ≤ 0.26)uv run python eval/run_evals.py --dataset dataset_1 --judge-provider mistral— PASS (0.98)uv run python eval/run_evals.py --dataset dataset_2 --judge-provider mistral— PASS (0.85)uv run python eval/run_evals.py --dataset dataset_3 --judge-provider mistral— PASS (0.90)make migrate;\d leaderboardshows all 5 indexes