feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard by VarunGitGood · Pull Request #50 · VarunGitGood/repi

VarunGitGood · 2026-06-04T13:26:15Z

Summary

Finishes the evaluation-judge feature on this branch on top of #44 / #47:

Gather → compile split (Issue ReAct loop: split evidence gathering from answer compilation #48). The ReAct loop now only gathers evidence and exits via done_gathering; a separate compile-LLM call in repi/investigation/compiler.py turns the gathered evidence into a validated InvestigationAnswer. A deterministic synth is the last-resort fallback. repi/llm/json_utils.py is the new shared JSON parser used by both the loop and the eval judge.
Timezone hardening at the DB boundary. asyncpg/timestamptz interprets naive datetimes through the process's local timezone, which silently shifts every write and query window on a non-UTC host. New DateHandler.to_aware_utc() is now applied at every read/write site that crosses the asyncpg/SQLAlchemy boundary; LogChunk.timestamp_{start,end} are declared DateTime(timezone=True) to match the TIMESTAMPTZ columns. Seed scripts (eval/dataset_*/seed.py) also pin their anchor weekday to UTC so seed and resolver share one clock.
Judge hard-test (eval/check_judge.py). Canned gold + fail answers per dataset, scored against the judge to verify it lands in the right buckets (gold ≥ 0.8, fail ≤ 0.5) before we trust it to rank real models. On Mistral self-grade today: gold 1.00, fail ≤ 0.26 across all 3 datasets — judge is trusted.
Reason-then-score judge prompt. eval/judge.py system prompt now forces explanation before score, reducing post-hoc rationalisation.
Leaderboard table. New leaderboard table in db/schema.sql (one row per (run_id, dataset) with criteria JSONB + stats + judge metadata). eval/run_evals.py auto-writes every run; failures are logged and never block the eval.
UI: SSE stream now emits phase_change events, step kind, and ships stats in the done payload; the investigation page renders them.

Test plan

uv run pytest tests/ -q — 120 passed (locally)
uv run python eval/check_judge.py --judge-provider mistral — all datasets ✓ (gold 1.00, fail ≤ 0.26)
uv run python eval/run_evals.py --dataset dataset_1 --judge-provider mistral — PASS (0.98)
uv run python eval/run_evals.py --dataset dataset_2 --judge-provider mistral — PASS (0.85)
uv run python eval/run_evals.py --dataset dataset_3 --judge-provider mistral — PASS (0.90)
Leaderboard rows persisted; criteria JSONB shape (5/6/6 entries) and stats payload verified via psql
Schema applied cleanly via make migrate; \d leaderboard shows all 5 indexes

- eval/check_judge.py: canned gold + fail answers per dataset, scored against the LLM judge to verify it lands in the right buckets (gold >= 0.8, fail <= 0.5) before we trust it to rank real models. - eval/judge.py: judge prompt now forces explanation before score (reason-then-score) to reduce post-hoc rationalisation. - eval/dataset_*/seed.py: pin anchor weekday to UTC so the seed and the resolver (which defaults to UTC) cannot drift by a day on a non-UTC host near midnight. - db/schema.sql + eval/run_evals.py: new leaderboard table; every run now writes one row per (run_id, dataset) with criteria JSONB and stats. Best-effort: a DB failure never blocks the eval run.

asyncpg's `timestamptz` codec interprets naive Python datetimes through the process's *local* timezone, which silently shifts every write and query window by the host offset on a non-UTC host (e.g. IST/+05:30). Internally the project treats datetimes as naive UTC, so the conversion has to happen at the DB boundary. - core/dates.py: new `DateHandler.to_aware_utc()` helper. - models/schema.py: declare `LogChunk.timestamp_{start,end}` as `DateTime(timezone=True)` to match the TIMESTAMPTZ DB columns. - investigation/sweep.py, investigation/tools.py, retrieval/filter_builder.py, retrieval/pgvector_store.py: attach UTC at every read/write site that crosses the asyncpg/SQLAlchemy boundary.

The ReAct loop now gathers evidence and exits via `done_gathering`; a separate compile-LLM call in repi/investigation/compiler.py turns the gathered evidence into a validated InvestigationAnswer. This narrows each prompt to a single job, cuts schema-violation churn from the investigation loop, and gives us a clean place to plug a deterministic "unable to determine" synth as a last-resort fallback. - investigation/react_loop.py: huge refactor. Loop replaces submit_answer with done_gathering, legacy submit_answer is still recognised as a done-gathering signal (its payload is discarded). parse_llm_response moves to repi/llm/json_utils.py and is re-exported for back-compat. - investigation/compiler.py (new): compile_answer (LLM, one validation retry), enforce_floors hook, and synthesize_answer (non-LLM fallback). - investigation/schema.py: floors + supporting types for compile. - llm/json_utils.py (new): shared parser used by both the loop and the eval judge. - llm/adapters.py: minor adjustments for compile-call usage. - api/investigate.py: SSE stream now emits phase_change events, includes step `kind`, and ships stats in the done payload. - web/*: render phase_change + kind + stats on the investigation page. - eval/criteria.py: criterion builder tweaks aligned with new schema. - tests/llm/, tests/investigation/test_compiler.py, test_react_loop_gathering.py: coverage for the new modules. - tests/eval/test_judge.py, test_react_loop_{ledger,parser,reflection}.py: updated for the new wiring. - CLAUDE.md, .gitignore: reflect the gather→compile flow and ignore legacy bug.json / eval/results artefacts.

vercel · 2026-06-04T13:26:22Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
repi	Ready	Preview, Comment	Jun 4, 2026 5:55pm

Copilot

Pull request overview

This PR completes the “LLM-as-judge” eval harness by splitting the investigation pipeline into evidence gathering vs answer compilation, hardening UTC handling at the DB boundary, adding a leaderboard persistence layer, and updating the UI/SSE stream to surface phases and telemetry.

Changes:

Split investigation into a gather-only ReAct loop plus a separate compile step (repi/investigation/compiler.py) that produces a validated InvestigationAnswer with deterministic fallback.
Harden timestamp handling for TIMESTAMPTZ columns by attaching UTC tzinfo at asyncpg/SQLAlchemy boundaries and aligning ORM schema with timezone-aware columns.
Expand eval harness: judge prompt updates + parse retry, canned judge calibration script, leaderboard table + persistence, and UI phase/stats rendering via SSE.

Reviewed changes

Copilot reviewed 33 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
web/lib/sse.ts	Add `phase`/`stats` to SSE hook; add step `kind` typing.
web/components/investigation-step.tsx	Render special step kinds (`reflection`/`signal`/`compile`) with distinct UI.
web/app/investigations/[id]/page.tsx	Show phase indicator (“gathering/compiling”) and display run stats when available.
tests/llm/test_adapters.py	Add tests for typed adapter error classification (429 vs other 4xx) and Mistral retry semantics.
tests/llm/init.py	Package marker for llm tests.
tests/investigation/test_react_loop_submit_answer.py	Remove legacy submit-answer finalize tests (loop no longer finalizes).
tests/investigation/test_react_loop_reflection.py	Update reflection tests for `done_gathering` + compile step and separate reflection budget.
tests/investigation/test_react_loop_parser.py	Switch parser tests to shared `repi.llm.json_utils` helpers.
tests/investigation/test_react_loop_ledger.py	Update ledger tests to gather/compile split and `done_gathering`.
tests/investigation/test_react_loop_gathering.py	New tests for gather-only behaviors (done signal, stall detection, null-action guard, stats).
tests/investigation/test_compiler.py	New tests for compiler LLM call + validation retry + deterministic synthesis + floor enforcement.
tests/eval/test_judge.py	Refactor judge tests for shared parser, parse-retry, and advisory precheck.
repi/retrieval/pgvector_store.py	Attach UTC tzinfo before DB writes to TIMESTAMPTZ timestamp columns.
repi/retrieval/filter_builder.py	Attach UTC tzinfo for time window filters at DB boundary.
repi/models/schema.py	Declare `timestamp_{start,end}` as timezone-aware SQLAlchemy `DateTime(timezone=True)`.
repi/llm/json_utils.py	New shared robust JSON extraction/parser for LLM replies.
repi/llm/adapters.py	Add `_check_4xx`, `LLMRateLimitError`, `LLMBadRequestError`; improve Mistral 429 handling.
repi/investigation/tools.py	Ensure ISO timestamps become tz-aware UTC before hitting asyncpg.
repi/investigation/sweep.py	Normalize sweep time window to tz-aware UTC at DB boundary.
repi/investigation/schema.py	Add server-side `enforce_floors` confidence/consistency enforcement.
repi/investigation/react_loop.py	Convert loop to gather-only with `done_gathering`, compile handoff, new stats, SSE phase hooks.
repi/investigation/compiler.py	New compile phase LLM call with validation retry + deterministic synth fallback.
repi/core/dates.py	Add `DateHandler.to_aware_utc()` for asyncpg/SQLAlchemy TIMESTAMPTZ boundary correctness.
repi/api/investigate.py	SSE now emits step `kind`, `phase_change` events, and includes stats in `done` payload for live runs.
eval/run_evals.py	Judge provider auto-selection (avoid self-grading), optional `--out`, leaderboard persistence, richer status.
eval/judge.py	Reason-then-score prompt, shared parser usage, parse retry, deterministic scoring for some criteria.
eval/dataset_3_jwt_key_rotation_noise/seed.py	Anchor seed weekday computation to UTC.
eval/dataset_2_insufficient_logging/seed.py	Anchor seed weekday computation to UTC.
eval/dataset_1_cascading_inventory_migration/seed.py	Anchor seed weekday computation to UTC.
eval/criteria.py	Support building criteria subsets (for deterministic+LLM split scoring).
eval/check_judge.py	New judge hard-test with canned gold/fail answers and thresholds.
db/schema.sql	Add `leaderboard` table + indexes.
CLAUDE.md	Update docs: eval output behavior, judge provider selection, gather/compile split.
bug.json	Remove legacy bug output artifact from repo.
.gitignore	Ignore legacy `bug.json` and optional eval output artifacts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+            if self.store:
+                await self.store.increment_llm_calls(investigation_obj.id)
+
+            if self.store:
+                await self.store.increment_llm_calls(investigation_obj.id)
+


+    evidence_ids = {c.get("chunk_id") for c in evidence if c.get("chunk_id")}
+    recent = recent_thoughts or []
+    validation_errors: Optional[list[str]] = None
+    last_parsed: dict = {}
+
+    for attempt in range(1, 3):
+        messages = _build_compile_messages(
+            query=query,
+            resolved_intent=resolved_intent,
+            evidence=evidence,
+            tool_ledger=tool_ledger,
+            recent_thoughts=recent,
+            validation_errors=validation_errors,
+            known_services=known_services,
+        )
+        try:
+            raw = await llm.complete(messages, max_tokens=max_tokens, temperature=0.0)
+        except Exception as exc:
+            logger.warning("Compiler LLM call attempt %d raised: %s", attempt, exc)
+            break
+


+        return CompileResult(
+            answer=adjusted,
+            source="llm_invalid",
+            attempts=2,
+            floor_adjustments=adjustments,
+        )


+    return CompileResult(
+        answer=synthesized,
+        source="deterministic",
+        attempts=2,
+        floor_adjustments=[],
+    )


 from sqlmodel import SQLModel, Field, Column, JSON
 from pgvector.sqlalchemy import Vector
-from sqlalchemy import TEXT, ARRAY, Index, String, Column
+from sqlalchemy import TEXT, ARRAY, Index, String, Column, DateTime


- react_loop: collapse the duplicate increment_llm_calls call at the gathering LLM site into a single guarded block, matching the reflection-path pattern. Removes double-counting of total_llm_calls. - compiler: track attempts actually made and report it on the llm_invalid and deterministic fallback CompileResults (and in the user-visible gaps message). Previously these hard-coded attempts=2 even when the first attempt raised, lying to the leaderboard stats. - models/schema: drop the unused `Column` re-export from `sqlmodel` so it stops shadowing the SQLAlchemy `Column` that sa_column uses.

VarunGitGood added 3 commits June 4, 2026 18:02

VarunGitGood requested a review from Copilot June 4, 2026 13:34

Copilot started reviewing on behalf of VarunGitGood June 4, 2026 13:35 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

vercel Bot deployed to Preview June 4, 2026 17:55 View deployment

VarunGitGood merged commit 4ce11d9 into main Jun 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard#50

feat: LLM-as-judge eval harness — gather/compile split, UTC hardening, leaderboard#50
VarunGitGood merged 4 commits into
mainfrom
feat/evaluation-judge

VarunGitGood commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VarunGitGood commented Jun 4, 2026

Summary

Test plan

Uh oh!

vercel Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 4, 2026 •

edited

Loading