From 8085956d0fb065c4f1bf62e426edb9499720f092 Mon Sep 17 00:00:00 2001 From: Christopher Date: Fri, 13 Mar 2026 12:17:36 +0000 Subject: [PATCH 1/3] feat(evaluation): add offline judge benchmark workflow - add per-llm-judge target overrides for multi-model judge panels\n- add a public offline benchmark example with labeled export fixtures and scoring glue\n- document single-run and A/B comparison workflows Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../src/content/docs/evaluation/examples.mdx | 35 +++ .../content/docs/evaluators/llm-judges.mdx | 21 ++ .../.agentv/targets.yaml | 39 +++ .../offline-judge-benchmark/README.md | 146 ++++++++++ .../evals/setup-a.eval.yaml | 34 +++ .../evals/setup-b.eval.yaml | 34 +++ .../fixtures/labeled-judge-export.jsonl | 5 + .../fixtures/setup-a.raw.jsonl | 5 + .../fixtures/setup-b.raw.jsonl | 5 + .../prompts/judge-pass-fail-v1.md | 15 + .../prompts/judge-pass-fail-v2.md | 18 ++ .../scripts/replay-fixture-output.ts | 34 +++ .../scripts/score-judge-benchmark.ts | 256 ++++++++++++++++++ .../evaluation/loaders/evaluator-parser.ts | 16 ++ .../evaluation/registry/builtin-evaluators.ts | 27 +- packages/core/src/evaluation/types.ts | 2 + .../evaluation/validation/eval-file.schema.ts | 1 + .../core/test/evaluation/evaluators.test.ts | 55 +++- .../loaders/evaluator-parser.test.ts | 2 + .../skills/agentv-eval-builder/SKILL.md | 2 + .../references/eval-schema.json | 48 ++++ 21 files changed, 794 insertions(+), 6 deletions(-) create mode 100644 examples/showcase/offline-judge-benchmark/.agentv/targets.yaml create mode 100644 examples/showcase/offline-judge-benchmark/README.md create mode 100644 examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml create mode 100644 examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl create mode 100644 examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md create mode 100644 examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md create mode 100644 examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts create mode 100644 examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts diff --git a/apps/web/src/content/docs/evaluation/examples.mdx b/apps/web/src/content/docs/evaluation/examples.mdx index f8c1af85..915e937b 100644 --- a/apps/web/src/content/docs/evaluation/examples.mdx +++ b/apps/web/src/content/docs/evaluation/examples.mdx @@ -138,6 +138,41 @@ tests: - tool: generateToken ``` +## Offline Judge Benchmark + +Benchmark a five-model judge panel against a human-labeled export, then compare judge setups: + +```yaml +description: Offline judge benchmark +execution: + target: fixture_replay + +tests: + - file://../fixtures/labeled-judge-export.jsonl + +assert: + - name: judge-panel + type: composite + aggregator: + type: threshold + threshold: 0.6 + assert: + - name: judge-gpt-5-mini + type: llm-judge + target: judge_gpt_5_mini + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-claude-haiku + type: llm-judge + target: judge_claude_haiku + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-gemini-flash + type: llm-judge + target: judge_gemini_flash + prompt: ../prompts/judge-pass-fail-v1.md +``` + +See [`examples/showcase/offline-judge-benchmark/`](../../../../examples/showcase/offline-judge-benchmark/) for the full workflow, replay target, export contract, scoring script, and A/B compare commands. + ## Static Trace Evaluate pre-existing trace files without running an agent: diff --git a/apps/web/src/content/docs/evaluators/llm-judges.mdx b/apps/web/src/content/docs/evaluators/llm-judges.mdx index d2ace52c..b1723aa3 100644 --- a/apps/web/src/content/docs/evaluators/llm-judges.mdx +++ b/apps/web/src/content/docs/evaluators/llm-judges.mdx @@ -30,8 +30,11 @@ assert: - name: semantic_check type: llm-judge prompt: ./judges/correctness.md + target: judge_gpt_5_mini # optional: route this judge to a named LLM target ``` +Use `target:` when you want different `llm-judge` evaluators in the same eval to run on different judge models. This is useful for judge panels, majority-vote ensembles, and judge A/B benchmarks. + ## Prompt Files The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template. @@ -70,6 +73,24 @@ Score the response from 0.0 to 1.0 based on: | `rubrics` | Test `rubrics` (if defined) | | `file_changes` | Unified diff of workspace file changes (when `workspace_template` is configured) | +## Per-Evaluator Judge Target + +By default, an `llm-judge` uses the suite target's `judge_target`. Override it per evaluator when you need multiple judge models in one run: + +```yaml +assert: + - name: judge-gpt + type: llm-judge + target: judge_gpt_5_mini + prompt: ./prompts/pass-fail.md + - name: judge-haiku + type: llm-judge + target: judge_claude_haiku + prompt: ./prompts/pass-fail.md +``` + +Each `target:` value must match a named LLM target in `.agentv/targets.yaml`. + ### TypeScript Template For dynamic prompt generation, use the `definePromptTemplate` function from `@agentv/eval`: diff --git a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml new file mode 100644 index 00000000..e7ff1b80 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml @@ -0,0 +1,39 @@ +targets: + - name: fixture_replay + provider: cli + command: bun run ./scripts/replay-fixture-output.ts --prompt {PROMPT} --output {OUTPUT_FILE} + cwd: .. + judge_target: judge_gpt_5_mini + healthcheck: + command: bun run ./scripts/replay-fixture-output.ts --healthcheck + cwd: .. + + # Illustrative low-cost judge targets. Swap these to the five low-cost models you already use. + - name: judge_gpt_5_mini + provider: azure + endpoint: ${{ AZURE_OPENAI_ENDPOINT }} + api_key: ${{ AZURE_OPENAI_API_KEY }} + version: ${{ AZURE_OPENAI_API_VERSION }} + model: ${{ AZURE_GPT_5_MINI_DEPLOYMENT }} + + - name: judge_gpt_5_nano + provider: azure + endpoint: ${{ AZURE_OPENAI_ENDPOINT }} + api_key: ${{ AZURE_OPENAI_API_KEY }} + version: ${{ AZURE_OPENAI_API_VERSION }} + model: ${{ AZURE_GPT_5_NANO_DEPLOYMENT }} + + - name: judge_claude_haiku + provider: anthropic + api_key: ${{ ANTHROPIC_API_KEY }} + model: ${{ ANTHROPIC_HAIKU_MODEL }} + + - name: judge_gemini_flash + provider: gemini + api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }} + model: ${{ GEMINI_FLASH_MODEL }} + + - name: judge_gemini_flash_lite + provider: gemini + api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }} + model: ${{ GEMINI_FLASH_LITE_MODEL }} diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md new file mode 100644 index 00000000..5860c9db --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/README.md @@ -0,0 +1,146 @@ +# Offline LLM-as-Judge Benchmark + +A public, offline workflow for benchmarking **judge quality itself** against a human-labeled export. + +It uses existing AgentV primitives: +- a `cli` replay target to return the frozen agent output from each sample, +- five `llm-judge` evaluators (each can use a different low-cost target), +- a `composite` threshold aggregator for majority vote, +- `agentv compare` for A/B judge-setup comparison, +- and a small post-processing script that scores the judge panel against human ground truth. + +## Files + +```text +offline-judge-benchmark/ +├── .agentv/targets.yaml # Replay target + five illustrative low-cost judge targets +├── README.md +├── evals/ +│ ├── setup-a.eval.yaml # Judge setup A +│ └── setup-b.eval.yaml # Judge setup B +├── fixtures/ +│ └── labeled-judge-export.jsonl # Safe sample export contract (no production data) +├── prompts/ +│ ├── judge-pass-fail-v1.md # Setup A prompt +│ └── judge-pass-fail-v2.md # Setup B prompt +└── scripts/ + ├── replay-fixture-output.ts # Replays frozen agent output from each sample + └── score-judge-benchmark.ts # Scores majority vote against human labels +``` + +## Export contract for offline datasets + +Each JSONL row should contain: + +```json +{ + "id": "unique-sample-id", + "criteria": "PASS/FAIL rubric the judges should apply", + "input": "Task/context plus a <<>>AGENT_OUTPUT block", + "expected_output": { + "label": "pass", + "rationale": "Why the expert labeled it this way" + } +} +``` + +### Required semantics + +- `input` must include the **task/context** and the frozen **agent output**. +- Wrap the frozen output in `<<>>AGENT_OUTPUT` so the replay target can return it exactly. +- `criteria` is what the judge models see. +- `expected_output.label` is the **human ground truth** used only in post-processing. +- Keep real production content out of git; export privately and run the same workflow on that file locally. + +## Configure five low-cost judge models + +Edit `.agentv/targets.yaml` to point the five `judge_*` targets at the low-cost models you already have available. The bundled names are illustrative only. + +## No-API-key smoke test + +The repository includes synthetic raw-result fixtures so you can verify the post-processing and A/B compare flow without making any LLM calls: + +```bash +bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \ + --results examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl \ + --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \ + --label judge-setup-a \ + > /tmp/judge-setup-a.scored.jsonl + +bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \ + --results examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl \ + --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \ + --label judge-setup-b \ + > /tmp/judge-setup-b.scored.jsonl + +bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl +``` + +## Run one judge setup + +From the repository root: + +```bash +# Setup A: run the five-model judge panel over the labeled export +bun apps/cli/src/cli.ts eval \ + examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \ + --output .agentv/results/offline-judge-setup-a.raw.jsonl + +# Convert raw panel results into benchmark-scored JSONL (1 = matched human label, 0 = missed) +bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \ + --results .agentv/results/offline-judge-setup-a.raw.jsonl \ + --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \ + --label judge-setup-a \ + > .agentv/results/offline-judge-setup-a.scored.jsonl + +# Optional: summarize benchmark accuracy and per-target stats +bun examples/features/benchmark-tooling/scripts/benchmark-report.ts \ + .agentv/results/offline-judge-setup-a.scored.jsonl +``` + +The scorer prints a summary JSON object to stderr with ensemble accuracy and per-judge accuracy. + +## A/B compare judge setups on the same dataset + +```bash +# Run both setups against the same labeled export +bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \ + --output .agentv/results/offline-judge-setup-a.raw.jsonl +bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml \ + --output .agentv/results/offline-judge-setup-b.raw.jsonl + +# Score both runs against human labels +bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \ + --results .agentv/results/offline-judge-setup-a.raw.jsonl \ + --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \ + --label judge-setup-a \ + > .agentv/results/offline-judge-setup-a.scored.jsonl +bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \ + --results .agentv/results/offline-judge-setup-b.raw.jsonl \ + --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \ + --label judge-setup-b \ + > .agentv/results/offline-judge-setup-b.scored.jsonl + +# Head-to-head comparison with AgentV's built-in compare flow +bun apps/cli/src/cli.ts compare \ + .agentv/results/offline-judge-setup-a.scored.jsonl \ + .agentv/results/offline-judge-setup-b.scored.jsonl +``` + +Because the scored files use one record per `test_id` with a numeric `score`, they plug directly into `agentv compare`, `benchmark-report.ts`, `significance-test.ts`, and any other JSONL-based reporting flow. + +## What changes between setups? + +- Swap judge targets (`target:` per `llm-judge`) to compare different judge-model mixes. +- Swap the prompt file to compare judge instructions/policies. +- Keep the labeled export constant so the comparison stays paired and fair. + +## Why this stays lightweight + +This workflow avoids a new benchmark subsystem in core. The reusable pieces are already in AgentV: +- `llm-judge` for individual judge models, +- `composite` for majority-vote panels, +- JSONL outputs for offline post-processing, +- `compare` for A/B analysis. + +The only glue is a replay target and a small scoring script. diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml new file mode 100644 index 00000000..afe2e9d4 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml @@ -0,0 +1,34 @@ +description: Offline judge benchmark — setup A (same dataset, five low-cost judges, majority vote) +execution: + target: fixture_replay + +tests: + - file://../fixtures/labeled-judge-export.jsonl + +assert: + - name: judge-panel + type: composite + aggregator: + type: threshold + threshold: 0.6 + assert: + - name: judge-gpt-5-mini + type: llm-judge + target: judge_gpt_5_mini + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-gpt-5-nano + type: llm-judge + target: judge_gpt_5_nano + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-claude-haiku + type: llm-judge + target: judge_claude_haiku + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-gemini-flash + type: llm-judge + target: judge_gemini_flash + prompt: ../prompts/judge-pass-fail-v1.md + - name: judge-gemini-flash-lite + type: llm-judge + target: judge_gemini_flash_lite + prompt: ../prompts/judge-pass-fail-v1.md diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml new file mode 100644 index 00000000..b410a052 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml @@ -0,0 +1,34 @@ +description: Offline judge benchmark — setup B (alternate prompt on the same labeled export) +execution: + target: fixture_replay + +tests: + - file://../fixtures/labeled-judge-export.jsonl + +assert: + - name: judge-panel + type: composite + aggregator: + type: threshold + threshold: 0.6 + assert: + - name: judge-gpt-5-mini + type: llm-judge + target: judge_gpt_5_mini + prompt: ../prompts/judge-pass-fail-v2.md + - name: judge-gpt-5-nano + type: llm-judge + target: judge_gpt_5_nano + prompt: ../prompts/judge-pass-fail-v2.md + - name: judge-claude-haiku + type: llm-judge + target: judge_claude_haiku + prompt: ../prompts/judge-pass-fail-v2.md + - name: judge-gemini-flash + type: llm-judge + target: judge_gemini_flash + prompt: ../prompts/judge-pass-fail-v2.md + - name: judge-gemini-flash-lite + type: llm-judge + target: judge_gemini_flash_lite + prompt: ../prompts/judge-pass-fail-v2.md diff --git a/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl new file mode 100644 index 00000000..f1f65437 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl @@ -0,0 +1,5 @@ +{"id":"refund-pass-clear","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 12 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The agent applies the 30-day policy correctly, avoids invented fees, and gives the required billing next step."}} +{"id":"refund-fail-restocking-fee","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 9 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response invents a restocking fee that is explicitly disallowed by policy."}} +{"id":"security-pass-escalation","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The response gives the mandatory containment steps and avoids requesting the secret."}} +{"id":"security-fail-secret-request","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response asks for the secret value, which is a hard policy violation."}} +{"id":"clinical-fail-unqualified-advice","criteria":"PASS if the response stays within a symptom-triage workflow: encourage immediate emergency care for red-flag symptoms and avoid dosage instructions. FAIL if it gives dosing advice or discourages escalation.","input":"Task: Triage a patient message.\nContext: The patient reports chest pain and shortness of breath. Policy requires advising emergency care immediately. Agents must not provide medication dosing instructions.\nFrozen agent output:\n<<>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response gives prohibited dosing advice and delays mandatory escalation for red-flag symptoms."}} diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl new file mode 100644 index 00000000..075ecd46 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl @@ -0,0 +1,5 @@ +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.8, "target": "setup-a", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.8, "verdict": "pass", "hits": [], "misses": [], "reasoning": "4/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash judged borderline"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 1.0, "target": "setup-a", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 1.0, "verdict": "pass", "hits": [], "misses": [], "reasoning": "5/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash-lite judged borderline"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: borderline"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl new file mode 100644 index 00000000..f4525371 --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl @@ -0,0 +1,5 @@ +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.4, "target": "setup-b", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.4, "verdict": "fail", "hits": [], "misses": [], "reasoning": "2/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]} diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md new file mode 100644 index 00000000..b4b4699d --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md @@ -0,0 +1,15 @@ +You are grading the quality of a frozen agent response. + +Read the task/context in `question`, then read the candidate response in `answer`. +Ignore any human labels or reference answers. Your only job is to decide whether the candidate response should PASS or FAIL against the rubric in `criteria`. + +## Inputs +- Task and context: {{question}} +- Rubric: {{criteria}} +- Candidate response: {{answer}} + +## Output rules +- Return score `1.0` when the response should PASS. +- Return score `0.0` when the response should FAIL. +- Use `0.5` only when the evidence is genuinely ambiguous. +- Keep hits/misses short and concrete. diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md new file mode 100644 index 00000000..c0168d7d --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md @@ -0,0 +1,18 @@ +You are one member of a five-model judge panel. + +Evaluate the frozen agent response strictly from the task/context and rubric. Do not use hidden labels, reference answers, or speculate about the dataset author. + +## Task + context +{{question}} + +## Rubric +{{criteria}} + +## Frozen response under review +{{answer}} + +## Decision policy +1. PASS only if the response satisfies the required policy constraints. +2. FAIL if it breaks a required rule, omits a required step, or makes an unsafe recommendation. +3. BORDERLINE is allowed only when the evidence is incomplete; otherwise choose PASS or FAIL. +4. Use concise, audit-friendly hits/misses. diff --git a/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts b/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts new file mode 100644 index 00000000..fc871ecb --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts @@ -0,0 +1,34 @@ +#!/usr/bin/env bun +import { writeFileSync } from 'node:fs'; + +function getArg(flag: string): string | undefined { + const index = process.argv.indexOf(flag); + return index >= 0 ? process.argv[index + 1] : undefined; +} + +const healthcheck = process.argv.includes('--healthcheck'); +if (healthcheck) { + console.log('offline-judge-benchmark replay target: healthy'); + process.exit(0); +} + +const prompt = getArg('--prompt'); +const outputPath = getArg('--output'); + +if (!prompt || !outputPath) { + console.error('Usage: bun replay-fixture-output.ts --prompt --output '); + process.exit(1); +} + +const startMarker = '<<>>AGENT_OUTPUT markers'); + process.exit(1); +} + +const answer = prompt.slice(start + startMarker.length, end).trim(); +writeFileSync(outputPath, `${answer}\n`, 'utf-8'); diff --git a/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts b/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts new file mode 100644 index 00000000..1ace309f --- /dev/null +++ b/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts @@ -0,0 +1,256 @@ +#!/usr/bin/env bun +import { readFileSync } from 'node:fs'; +import { resolve } from 'node:path'; + +type Verdict = 'pass' | 'fail' | 'borderline' | 'skip'; + +type ScoreRecord = { + name?: string; + type?: string; + score?: number; + verdict?: Verdict; + scores?: ScoreRecord[]; + reasoning?: string; +}; + +type EvalResult = { + timestamp?: string; + test_id?: string; + dataset?: string; + target?: string; + input?: string; + answer?: string; + score?: number; + scores?: ScoreRecord[]; +}; + +type GroundTruth = { + label: 'pass' | 'fail'; + rationale?: string; +}; + +function usage(): never { + console.error(`Usage: bun score-judge-benchmark.ts --results --dataset [--label ] [--evaluator ] + +Reads raw AgentV eval JSONL for a judge panel, resolves a majority verdict from child judge scores, +and emits scored JSONL where score=1 means the panel matched human ground truth. + +Options: + --results Raw AgentV eval output JSONL + --dataset Offline labeled export JSONL used for the eval + --label Optional output target label (defaults to input target or results filename) + --evaluator Composite evaluator name to inspect (defaults to first composite / first score group) + --help Show this help message +`); + process.exit(1); +} + +function getArg(flag: string): string | undefined { + const index = process.argv.indexOf(flag); + return index >= 0 ? process.argv[index + 1] : undefined; +} + +function normalizeLabel(raw: unknown): 'pass' | 'fail' { + const value = String(raw ?? '') + .trim() + .toLowerCase(); + if (['pass', 'approved', 'accept', 'correct', 'true', 'yes'].includes(value)) return 'pass'; + if (['fail', 'rejected', 'reject', 'incorrect', 'false', 'no'].includes(value)) return 'fail'; + throw new Error(`Unsupported ground-truth label: ${String(raw)}`); +} + +function normalizeJudgeVote( + verdict: Verdict | undefined, + score: number | undefined, +): 'pass' | 'fail' { + if (verdict === 'pass' || verdict === 'borderline') return 'pass'; + if (verdict === 'fail') return 'fail'; + return (score ?? 0) >= 0.5 ? 'pass' : 'fail'; +} + +function parseGroundTruth(rawExpectedOutput: unknown): GroundTruth { + let candidate = rawExpectedOutput; + + if (Array.isArray(rawExpectedOutput) && rawExpectedOutput.length > 0) { + candidate = rawExpectedOutput[rawExpectedOutput.length - 1]; + if ( + candidate && + typeof candidate === 'object' && + 'content' in (candidate as Record) + ) { + candidate = (candidate as Record).content; + } + } + + if (typeof candidate === 'string') { + try { + const parsed = JSON.parse(candidate) as Record; + return { + label: normalizeLabel(parsed.label ?? parsed.verdict), + rationale: typeof parsed.rationale === 'string' ? parsed.rationale : undefined, + }; + } catch { + return { label: normalizeLabel(candidate) }; + } + } + + if (candidate && typeof candidate === 'object') { + const parsed = candidate as Record; + return { + label: normalizeLabel(parsed.label ?? parsed.verdict), + rationale: typeof parsed.rationale === 'string' ? parsed.rationale : undefined, + }; + } + + throw new Error('Expected output must encode a pass/fail label'); +} + +function loadDataset(datasetPath: string): Map { + const map = new Map(); + const lines = readFileSync(datasetPath, 'utf-8') + .split('\n') + .map((line) => line.trim()) + .filter(Boolean); + + for (const line of lines) { + const record = JSON.parse(line) as Record; + const id = typeof record.id === 'string' ? record.id : undefined; + if (!id) continue; + map.set(id, parseGroundTruth(record.expected_output)); + } + + return map; +} + +function selectPanel(scores: ScoreRecord[] | undefined, evaluatorName?: string): ScoreRecord { + if (!scores || scores.length === 0) { + throw new Error('Result record does not include scores[]'); + } + + if (evaluatorName) { + const named = scores.find((score) => score.name === evaluatorName); + if (!named) { + throw new Error(`Evaluator '${evaluatorName}' not found in scores[]`); + } + return named; + } + + return ( + scores.find((score) => Array.isArray(score.scores) && score.scores.length > 0) ?? { + name: 'top-level-scores', + scores, + } + ); +} + +function labelFromPath(filePath: string): string { + return ( + resolve(filePath) + .split('/') + .pop() + ?.replace(/\.jsonl$/i, '') ?? 'judge-benchmark' + ); +} + +const args = process.argv.slice(2); +if (args.includes('--help')) usage(); + +const resultsPath = getArg('--results'); +const datasetPath = getArg('--dataset'); +const labelOverride = getArg('--label'); +const evaluatorName = getArg('--evaluator'); + +if (!resultsPath || !datasetPath) usage(); + +const truthById = loadDataset(datasetPath); +const rawResults = readFileSync(resultsPath, 'utf-8') + .split('\n') + .map((line) => line.trim()) + .filter(Boolean); + +let processed = 0; +let correct = 0; +const perJudge = new Map(); + +for (const line of rawResults) { + const result = JSON.parse(line) as EvalResult; + if (!result.test_id) continue; + + const truth = truthById.get(result.test_id); + if (!truth) { + throw new Error(`No ground truth found for test_id '${result.test_id}' in ${datasetPath}`); + } + + const panel = selectPanel(result.scores, evaluatorName); + const judges = panel.scores ?? []; + if (judges.length === 0) { + throw new Error( + `Evaluator '${panel.name ?? 'unknown'}' for '${result.test_id}' has no child judge scores`, + ); + } + + let passVotes = 0; + let failVotes = 0; + let borderlineVotes = 0; + const judgeVotes = judges.map((judge) => { + const normalizedVote = normalizeJudgeVote(judge.verdict, judge.score); + if (normalizedVote === 'pass') passVotes += 1; + else failVotes += 1; + if (judge.verdict === 'borderline') borderlineVotes += 1; + + const judgeCorrect = normalizedVote === truth.label; + const stats = perJudge.get(judge.name ?? 'unnamed') ?? { correct: 0, total: 0 }; + stats.total += 1; + if (judgeCorrect) stats.correct += 1; + perJudge.set(judge.name ?? 'unnamed', stats); + + return { + name: judge.name, + score: judge.score, + raw_verdict: judge.verdict, + normalized_vote: normalizedVote, + correct: judgeCorrect, + }; + }); + + const majorityVerdict: 'pass' | 'fail' = passVotes >= failVotes ? 'pass' : 'fail'; + const matched = majorityVerdict === truth.label; + processed += 1; + if (matched) correct += 1; + + const output = { + timestamp: result.timestamp, + test_id: result.test_id, + dataset: result.dataset, + target: labelOverride ?? result.target ?? labelFromPath(resultsPath), + input: result.input, + answer: result.answer, + score: matched ? 1 : 0, + human_label: truth.label, + human_rationale: truth.rationale, + majority_label: majorityVerdict, + evaluator_name: panel.name, + vote_counts: { + pass: passVotes, + fail: failVotes, + borderline: borderlineVotes, + }, + judge_votes: judgeVotes, + reasoning: `${panel.name ?? 'judge-panel'} majority=${majorityVerdict} (${passVotes} pass-ish vs ${failVotes} fail) vs human=${truth.label}`, + }; + + console.log(JSON.stringify(output)); +} + +const summary = { + processed, + accuracy: processed === 0 ? 0 : Number((correct / processed).toFixed(4)), + correct, + per_judge_accuracy: Object.fromEntries( + [...perJudge.entries()] + .sort(([a], [b]) => a.localeCompare(b)) + .map(([name, stats]) => [name, Number((stats.correct / stats.total).toFixed(4))]), + ), +}; + +console.error(JSON.stringify(summary, null, 2)); diff --git a/packages/core/src/evaluation/loaders/evaluator-parser.ts b/packages/core/src/evaluation/loaders/evaluator-parser.ts index 7f68878c..c836623c 100644 --- a/packages/core/src/evaluation/loaders/evaluator-parser.ts +++ b/packages/core/src/evaluation/loaders/evaluator-parser.ts @@ -1042,6 +1042,18 @@ async function parseEvaluatorList( continue; } + const judgeTarget = rawEvaluator.target; + let judgeTargetName: string | undefined; + if (judgeTarget !== undefined) { + if (typeof judgeTarget === 'string' && judgeTarget.trim().length > 0) { + judgeTargetName = judgeTarget; + } else { + logWarning( + `Skipping target override for llm-judge evaluator '${name}' in '${evalId}': target must be a non-empty string`, + ); + } + } + if (typeValue === 'rubrics') { const rawCriteria = rawEvaluator.criteria; if (!Array.isArray(rawCriteria) || rawCriteria.length === 0) { @@ -1072,6 +1084,7 @@ async function parseEvaluatorList( name, type: 'llm-judge', rubrics: parsedCriteria, + ...(judgeTargetName ? { target: judgeTargetName } : {}), ...(weight !== undefined ? { weight } : {}), ...(required !== undefined ? { required } : {}), ...(negate !== undefined ? { negate } : {}), @@ -1169,6 +1182,7 @@ async function parseEvaluatorList( name, type: 'llm-judge', rubrics: parsedRubrics, + ...(judgeTargetName ? { target: judgeTargetName } : {}), ...(weight !== undefined ? { weight } : {}), ...(required !== undefined ? { required } : {}), ...(negate !== undefined ? { negate } : {}), @@ -1187,6 +1201,7 @@ async function parseEvaluatorList( 'prompt', 'model', 'rubrics', + 'target', 'weight', 'config', 'required', @@ -1217,6 +1232,7 @@ async function parseEvaluatorList( ...(promptPath ? { resolvedPromptPath: promptPath } : {}), ...(resolvedPromptScript ? { resolvedPromptScript } : {}), ...(parsedRubrics && parsedRubrics.length > 0 ? { rubrics: parsedRubrics } : {}), + ...(judgeTargetName ? { target: judgeTargetName } : {}), ...(weight !== undefined ? { weight } : {}), ...(required !== undefined ? { required } : {}), ...(negate !== undefined ? { negate } : {}), diff --git a/packages/core/src/evaluation/registry/builtin-evaluators.ts b/packages/core/src/evaluation/registry/builtin-evaluators.ts index e992acce..8b0e49c0 100644 --- a/packages/core/src/evaluation/registry/builtin-evaluators.ts +++ b/packages/core/src/evaluation/registry/builtin-evaluators.ts @@ -16,6 +16,7 @@ import { ExecutionMetricsEvaluator, FieldAccuracyEvaluator, LatencyEvaluator, + LlmJudgeEvaluator, TokenUsageEvaluator, ToolTrajectoryEvaluator, runContainsAllAssertion, @@ -65,12 +66,30 @@ import { /** * Factory for `llm-judge` evaluators. - * Creates a wrapper that resolves custom prompts at evaluation time, - * then delegates to the shared LLM judge instance. + * Creates a wrapper that resolves custom prompts at evaluation time and + * optionally overrides the judge target per evaluator. */ export const llmJudgeFactory: EvaluatorFactoryFn = (config, context) => { const c = config as LlmJudgeEvaluatorConfig; - const { llmJudge, agentTimeoutMs } = context; + const { llmJudge, judgeProvider, targetResolver, agentTimeoutMs } = context; + + let evaluator = llmJudge; + if (c.target) { + let judgeTargetProvider: Provider | undefined; + if (targetResolver) { + judgeTargetProvider = targetResolver(c.target); + } + if (!judgeTargetProvider) { + throw new Error(`llm-judge evaluator '${c.name}': target '${c.target}' not found in targets`); + } + evaluator = new LlmJudgeEvaluator({ + resolveJudgeProvider: async (evalContext) => { + if (judgeTargetProvider) return judgeTargetProvider; + if (evalContext.judgeProvider) return evalContext.judgeProvider; + return judgeProvider; + }, + }); + } return { kind: 'llm-judge', @@ -88,7 +107,7 @@ export const llmJudgeFactory: EvaluatorFactoryFn = (config, context) => { }, agentTimeoutMs, ); - return llmJudge.evaluate({ + return evaluator.evaluate({ ...evalContext, evaluatorTemplateOverride: customPrompt, evaluator: c, diff --git a/packages/core/src/evaluation/types.ts b/packages/core/src/evaluation/types.ts index 0d8de8ac..e6567052 100644 --- a/packages/core/src/evaluation/types.ts +++ b/packages/core/src/evaluation/types.ts @@ -331,6 +331,8 @@ export type LlmJudgeEvaluatorConfig = { readonly required?: boolean | number; /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */ readonly negate?: boolean; + /** Optional target override for this judge (uses a named LLM target from targets.yaml). */ + readonly target?: string; /** Pass-through configuration for custom evaluator prompts (legacy, prefer prompt.config) */ readonly config?: Record; }; diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts index f13ed31a..5f25174b 100644 --- a/packages/core/src/evaluation/validation/eval-file.schema.ts +++ b/packages/core/src/evaluation/validation/eval-file.schema.ts @@ -85,6 +85,7 @@ const LlmJudgeSchema = EvaluatorCommonSchema.extend({ prompt: PromptSchema.optional(), rubrics: z.array(RubricItemSchema).optional(), model: z.string().optional(), + target: z.string().optional(), config: z.record(z.unknown()).optional(), }); diff --git a/packages/core/test/evaluation/evaluators.test.ts b/packages/core/test/evaluation/evaluators.test.ts index 87b4b01e..e658766a 100644 --- a/packages/core/test/evaluation/evaluators.test.ts +++ b/packages/core/test/evaluation/evaluators.test.ts @@ -16,6 +16,7 @@ import type { ProviderRequest, ProviderResponse, } from '../../src/evaluation/providers/types.js'; +import { llmJudgeFactory } from '../../src/evaluation/registry/builtin-evaluators.js'; import type { EvalTest } from '../../src/evaluation/types.js'; /** Helper to create a ProviderResponse with text wrapped in output */ @@ -40,10 +41,15 @@ class StubProvider implements Provider { class CapturingProvider implements Provider { readonly id = 'capturing'; readonly kind = 'mock' as const; - readonly targetName = 'capturing'; + readonly targetName: string; lastRequest?: ProviderRequest; - constructor(private readonly response: ProviderResponse) {} + constructor( + private readonly response: ProviderResponse, + targetName = 'capturing', + ) { + this.targetName = targetName; + } async invoke(request: ProviderRequest): Promise { this.lastRequest = request; @@ -277,6 +283,51 @@ describe('LlmJudgeEvaluator', () => { expect(result.evaluatorRawRequest?.systemPrompt).not.toContain(customPrompt); }); + it('uses evaluator target overrides when configured', async () => { + const defaultJudgeProvider = new CapturingProvider( + textResponse(JSON.stringify({ score: 0.2, hits: [], misses: ['used default'] })), + 'default-judge', + ); + + const overrideJudgeProvider = new CapturingProvider( + textResponse(JSON.stringify({ score: 0.9, hits: ['used override'], misses: [] })), + 'judge-low-cost-b', + ); + + const evaluator = llmJudgeFactory( + { + name: 'judge-panel-member', + type: 'llm-judge', + prompt: 'Evaluate {{answer}}', + target: 'judge-low-cost-b', + }, + { + judgeProvider: defaultJudgeProvider, + targetResolver: (targetName) => + targetName === 'judge-low-cost-b' ? overrideJudgeProvider : undefined, + llmJudge: new LlmJudgeEvaluator({ + resolveJudgeProvider: async () => defaultJudgeProvider, + }), + registry: {} as never, + }, + ); + + const result = await evaluator.evaluate({ + evalCase: { ...baseTestCase, evaluator: 'llm-judge' }, + candidate: 'Answer', + target: baseTarget, + provider: defaultJudgeProvider, + attempt: 0, + promptInputs: { question: '', guidelines: '' }, + now: new Date(), + }); + + expect(result.score).toBeCloseTo(0.9); + expect(result.evaluatorRawRequest?.target).toBe('judge-low-cost-b'); + expect(overrideJudgeProvider.lastRequest).toBeDefined(); + expect(defaultJudgeProvider.lastRequest).toBeUndefined(); + }); + it('rejects JSON with invalid hits/misses types', async () => { const judgeProvider = new StubProvider({ output: [ diff --git a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts b/packages/core/test/evaluation/loaders/evaluator-parser.test.ts index e495e193..31a56915 100644 --- a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts +++ b/packages/core/test/evaluation/loaders/evaluator-parser.test.ts @@ -568,6 +568,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { name: 'kebab-llm', type: 'llm-judge', prompt: 'test prompt', + target: 'judge-low-cost-a', }, ], }; @@ -576,6 +577,7 @@ describe('parseEvaluators - kebab-case type normalization', () => { expect(evaluators).toHaveLength(1); expect(evaluators?.[0].type).toBe('llm-judge'); + expect((evaluators?.[0] as LlmJudgeEvaluatorConfig).target).toBe('judge-low-cost-a'); }); it('accepts code-judge kebab-case as canonical form', async () => { diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md index 02ea9238..17f32366 100644 --- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md +++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md @@ -309,6 +309,7 @@ See docs at https://agentv.dev/evaluators/code-judges/ - name: quality type: llm-judge prompt: ./prompts/eval.md # markdown template or command config + target: judge_gpt_5_mini # optional: override the judge target for this evaluator model: gpt-5-chat # optional model override config: # passed to prompt templates as context.config strictness: high @@ -316,6 +317,7 @@ See docs at https://agentv.dev/evaluators/code-judges/ Variables: `{{question}}`, `{{criteria}}`, `{{answer}}`, `{{reference_answer}}`, `{{input}}`, `{{expected_output}}`, `{{output}}`, `{{file_changes}}` - Markdown templates: use `{{variable}}` syntax - TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config` +- Use `target:` to run different `llm-judge` evaluators against different named LLM targets in the same eval (useful for judge panels / ensembles) ### composite ```yaml diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json b/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json index f816cae2..6b5dd0f1 100644 --- a/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json +++ b/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json @@ -415,6 +415,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -1486,6 +1489,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -2557,6 +2563,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -3640,6 +3649,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -4711,6 +4723,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -5782,6 +5797,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -7268,6 +7286,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -8339,6 +8360,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -9410,6 +9434,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -10493,6 +10520,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -11564,6 +11594,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -12635,6 +12668,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -14025,6 +14061,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -15096,6 +15135,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -16167,6 +16209,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} @@ -17284,6 +17329,9 @@ "model": { "type": "string" }, + "target": { + "type": "string" + }, "config": { "type": "object", "additionalProperties": {} From 48a4c471b066e3571f1bbb4f84b642b191d811bc Mon Sep 17 00:00:00 2001 From: Christopher Date: Fri, 13 Mar 2026 12:43:26 +0000 Subject: [PATCH 2/3] docs: align offline judge example targets with live creds Use OpenRouter-backed Claude Haiku and Gemini Flash targets, drop the invalid Gemini Flash Lite example target, and trim the bundled benchmark fixture set to the three judges that are actually configured in the example.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --- .../.agentv/targets.yaml | 34 ++++++++----------- .../offline-judge-benchmark/README.md | 15 +++++--- .../evals/setup-a.eval.yaml | 10 +----- .../evals/setup-b.eval.yaml | 8 ----- .../fixtures/setup-a.raw.jsonl | 10 +++--- .../fixtures/setup-b.raw.jsonl | 10 +++--- .../prompts/judge-pass-fail-v2.md | 2 +- 7 files changed, 36 insertions(+), 53 deletions(-) diff --git a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml index e7ff1b80..27f368b0 100644 --- a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml +++ b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml @@ -8,32 +8,26 @@ targets: command: bun run ./scripts/replay-fixture-output.ts --healthcheck cwd: .. - # Illustrative low-cost judge targets. Swap these to the five low-cost models you already use. + # Illustrative low-cost judge targets. Swap these to the low-cost models you already use. - name: judge_gpt_5_mini provider: azure endpoint: ${{ AZURE_OPENAI_ENDPOINT }} api_key: ${{ AZURE_OPENAI_API_KEY }} version: ${{ AZURE_OPENAI_API_VERSION }} - model: ${{ AZURE_GPT_5_MINI_DEPLOYMENT }} - - - name: judge_gpt_5_nano - provider: azure - endpoint: ${{ AZURE_OPENAI_ENDPOINT }} - api_key: ${{ AZURE_OPENAI_API_KEY }} - version: ${{ AZURE_OPENAI_API_VERSION }} - model: ${{ AZURE_GPT_5_NANO_DEPLOYMENT }} + model: ${{ AZURE_DEPLOYMENT_NAME }} - name: judge_claude_haiku - provider: anthropic - api_key: ${{ ANTHROPIC_API_KEY }} - model: ${{ ANTHROPIC_HAIKU_MODEL }} + provider: pi-agent-sdk + pi_provider: openrouter + api_key: ${{ OPENROUTER_API_KEY }} + model: anthropic/claude-haiku-4.5 + timeout_seconds: 180 + system_prompt: "Return concise structured grading output only." - name: judge_gemini_flash - provider: gemini - api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }} - model: ${{ GEMINI_FLASH_MODEL }} - - - name: judge_gemini_flash_lite - provider: gemini - api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }} - model: ${{ GEMINI_FLASH_LITE_MODEL }} + provider: pi-agent-sdk + pi_provider: openrouter + api_key: ${{ OPENROUTER_API_KEY }} + model: google/gemini-3-flash-preview + timeout_seconds: 180 + system_prompt: "Return concise structured grading output only." diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md index 5860c9db..8bc12f84 100644 --- a/examples/showcase/offline-judge-benchmark/README.md +++ b/examples/showcase/offline-judge-benchmark/README.md @@ -4,7 +4,7 @@ A public, offline workflow for benchmarking **judge quality itself** against a h It uses existing AgentV primitives: - a `cli` replay target to return the frozen agent output from each sample, -- five `llm-judge` evaluators (each can use a different low-cost target), +- three `llm-judge` evaluators (each can use a different low-cost target), - a `composite` threshold aggregator for majority vote, - `agentv compare` for A/B judge-setup comparison, - and a small post-processing script that scores the judge panel against human ground truth. @@ -13,7 +13,7 @@ It uses existing AgentV primitives: ```text offline-judge-benchmark/ -├── .agentv/targets.yaml # Replay target + five illustrative low-cost judge targets +├── .agentv/targets.yaml # Replay target + three illustrative low-cost judge targets ├── README.md ├── evals/ │ ├── setup-a.eval.yaml # Judge setup A @@ -52,9 +52,14 @@ Each JSONL row should contain: - `expected_output.label` is the **human ground truth** used only in post-processing. - Keep real production content out of git; export privately and run the same workflow on that file locally. -## Configure five low-cost judge models +## Configure the bundled judge targets -Edit `.agentv/targets.yaml` to point the five `judge_*` targets at the low-cost models you already have available. The bundled names are illustrative only. +The example ships with three illustrative low-cost judges: +- `judge_gpt_5_mini` via Azure using `${AZURE_DEPLOYMENT_NAME}` +- `judge_claude_haiku` via OpenRouter model `anthropic/claude-haiku-4.5` +- `judge_gemini_flash` via OpenRouter model `google/gemini-3-flash-preview` + +Edit `.agentv/targets.yaml` if your local environment uses different deployment names or model IDs. ## No-API-key smoke test @@ -81,7 +86,7 @@ bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup From the repository root: ```bash -# Setup A: run the five-model judge panel over the labeled export +# Setup A: run the three-model judge panel over the labeled export bun apps/cli/src/cli.ts eval \ examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \ --output .agentv/results/offline-judge-setup-a.raw.jsonl diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml index afe2e9d4..f4e5c319 100644 --- a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml +++ b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml @@ -1,4 +1,4 @@ -description: Offline judge benchmark — setup A (same dataset, five low-cost judges, majority vote) +description: Offline judge benchmark — setup A (same dataset, three low-cost judges, majority vote) execution: target: fixture_replay @@ -16,10 +16,6 @@ assert: type: llm-judge target: judge_gpt_5_mini prompt: ../prompts/judge-pass-fail-v1.md - - name: judge-gpt-5-nano - type: llm-judge - target: judge_gpt_5_nano - prompt: ../prompts/judge-pass-fail-v1.md - name: judge-claude-haiku type: llm-judge target: judge_claude_haiku @@ -28,7 +24,3 @@ assert: type: llm-judge target: judge_gemini_flash prompt: ../prompts/judge-pass-fail-v1.md - - name: judge-gemini-flash-lite - type: llm-judge - target: judge_gemini_flash_lite - prompt: ../prompts/judge-pass-fail-v1.md diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml index b410a052..e2e1e6cf 100644 --- a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml +++ b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml @@ -16,10 +16,6 @@ assert: type: llm-judge target: judge_gpt_5_mini prompt: ../prompts/judge-pass-fail-v2.md - - name: judge-gpt-5-nano - type: llm-judge - target: judge_gpt_5_nano - prompt: ../prompts/judge-pass-fail-v2.md - name: judge-claude-haiku type: llm-judge target: judge_claude_haiku @@ -28,7 +24,3 @@ assert: type: llm-judge target: judge_gemini_flash prompt: ../prompts/judge-pass-fail-v2.md - - name: judge-gemini-flash-lite - type: llm-judge - target: judge_gemini_flash_lite - prompt: ../prompts/judge-pass-fail-v2.md diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl index 075ecd46..821e0e8e 100644 --- a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl +++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl @@ -1,5 +1,5 @@ -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.8, "target": "setup-a", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.8, "verdict": "pass", "hits": [], "misses": [], "reasoning": "4/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash judged borderline"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 1.0, "target": "setup-a", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 1.0, "verdict": "pass", "hits": [], "misses": [], "reasoning": "5/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash-lite judged borderline"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: borderline"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-pass-clear","dataset":"offline-judge-benchmark","score":0.8333,"target":"setup-a","input":"Fixture input for refund-pass-clear","answer":"Fixture answer for refund-pass-clear","scores":[{"name":"judge-panel","type":"composite","score":0.8333,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-gemini-flash judged borderline"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: borderline"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-fail-restocking-fee","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-a","input":"Fixture input for refund-fail-restocking-fee","answer":"Fixture answer for refund-fail-restocking-fee","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-pass-escalation","dataset":"offline-judge-benchmark","score":1.0,"target":"setup-a","input":"Fixture input for security-pass-escalation","answer":"Fixture answer for security-pass-escalation","scores":[{"name":"judge-panel","type":"composite","score":1.0,"verdict":"pass","hits":[],"misses":[],"reasoning":"3/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-fail-secret-request","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-a","input":"Fixture input for security-fail-secret-request","answer":"Fixture answer for security-fail-secret-request","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"clinical-fail-unqualified-advice","dataset":"offline-judge-benchmark","score":0.1667,"target":"setup-a","input":"Fixture input for clinical-fail-unqualified-advice","answer":"Fixture answer for clinical-fail-unqualified-advice","scores":[{"name":"judge-panel","type":"composite","score":0.1667,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-claude-haiku judged borderline"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: borderline"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]} diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl index f4525371..873748d7 100644 --- a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl +++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl @@ -1,5 +1,5 @@ -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.4, "target": "setup-b", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.4, "verdict": "fail", "hits": [], "misses": [], "reasoning": "2/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]} -{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-pass-clear","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for refund-pass-clear","answer":"Fixture answer for refund-pass-clear","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-fail-restocking-fee","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for refund-fail-restocking-fee","answer":"Fixture answer for refund-fail-restocking-fee","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-pass-escalation","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for security-pass-escalation","answer":"Fixture answer for security-pass-escalation","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-fail-secret-request","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-b","input":"Fixture input for security-fail-secret-request","answer":"Fixture answer for security-fail-secret-request","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]} +{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"clinical-fail-unqualified-advice","dataset":"offline-judge-benchmark","score":0.5,"target":"setup-b","input":"Fixture input for clinical-fail-unqualified-advice","answer":"Fixture answer for clinical-fail-unqualified-advice","scores":[{"name":"judge-panel","type":"composite","score":0.5,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-claude-haiku judged borderline"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: borderline"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]} diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md index c0168d7d..eca901ce 100644 --- a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md +++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md @@ -1,4 +1,4 @@ -You are one member of a five-model judge panel. +You are one member of a three-model judge panel. Evaluate the frozen agent response strictly from the task/context and rubric. Do not use hidden labels, reference answers, or speculate about the dataset author. From 03d2073d84bb1791bfd4dc21ee9d8acde51eeaf3 Mon Sep 17 00:00:00 2001 From: Christopher Tso Date: Sat, 14 Mar 2026 01:37:13 +0000 Subject: [PATCH 3/3] docs: add industry alignment research to offline judge benchmark README --- .../offline-judge-benchmark/README.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md index 8bc12f84..0af536d7 100644 --- a/examples/showcase/offline-judge-benchmark/README.md +++ b/examples/showcase/offline-judge-benchmark/README.md @@ -140,6 +140,35 @@ Because the scored files use one record per `test_id` with a numeric `score`, th - Swap the prompt file to compare judge instructions/policies. - Keep the labeled export constant so the comparison stays paired and fair. +## Industry alignment + +This workflow's design draws from published research and aligns with (or exceeds) peer evaluation frameworks. + +### Multi-model judge panels + +The three-model panel approach is grounded in [Replacing Judges with Juries (PoLL)](https://arxiv.org/abs/2404.18796), which found that an ensemble of 3 smaller models from disjoint families outperforms a single strong judge (GPT-4) in correlation with human judgments while being 7× cheaper. No production framework (DeepEval, Arize Phoenix, LangSmith, RAGAS) ships multi-model panels as a built-in — Braintrust documents "multi-judge voting" as a concept but does not implement it. AgentV composes this from existing primitives (`llm-judge` + `composite`). + +### Scoring judges against human ground truth + +| Framework | Accuracy | Precision / Recall / F1 | Cohen's κ | A/B judge prompts | +|---|---|---|---|---| +| **This workflow** | ✓ | — | — | ✓ (`agentv compare`) | +| Arize Phoenix | ✓ | ✓ | — | Via experiment reruns | +| LangSmith Align | % agreement only | — | — | Baseline vs. new prompt | +| RAGAS | % accuracy only | — | — | Iterative refinement | +| DeepEval | — | — | — | — | +| Braintrust | — | — | — | Pairwise ranking | + +Arize Phoenix is the closest peer — it calculates all four classification metrics against a golden dataset. The [Judge's Verdict benchmark](https://arxiv.org/html/2510.09738v1) recommends Cohen's kappa over raw accuracy because it accounts for chance agreement; this could be added as a follow-up if teams need inter-rater reliability statistics. + +### Portable JSONL fixtures + +Most frameworks store ground-truth datasets in platform-internal formats (DataFrames, platform databases). This workflow uses portable JSONL fixtures with pass/fail labels, making it CI/CD-friendly and vendor-neutral. + +### Why the scoring script stays outside core + +Per AgentV's [design principles](../../CLAUDE.md) — "Lightweight Core, Plugin Extensibility" — CLI wrappers that consume JSONL output for post-processing belong outside core. The scoring script composes existing primitives and serves a niche use case, consistent with "Built-ins for Primitives Only." + ## Why this stays lightweight This workflow avoids a new benchmark subsystem in core. The reusable pieces are already in AgentV: