From 8085956d0fb065c4f1bf62e426edb9499720f092 Mon Sep 17 00:00:00 2001
From: Christopher <christso@gmail.com>
Date: Fri, 13 Mar 2026 12:17:36 +0000
Subject: [PATCH 1/3] feat(evaluation): add offline judge benchmark workflow

- add per-llm-judge target overrides for multi-model judge panels\n- add a public offline benchmark example with labeled export fixtures and scoring glue\n- document single-run and A/B comparison workflows

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../src/content/docs/evaluation/examples.mdx  |  35 +++
 .../content/docs/evaluators/llm-judges.mdx    |  21 ++
 .../.agentv/targets.yaml                      |  39 +++
 .../offline-judge-benchmark/README.md         | 146 ++++++++++
 .../evals/setup-a.eval.yaml                   |  34 +++
 .../evals/setup-b.eval.yaml                   |  34 +++
 .../fixtures/labeled-judge-export.jsonl       |   5 +
 .../fixtures/setup-a.raw.jsonl                |   5 +
 .../fixtures/setup-b.raw.jsonl                |   5 +
 .../prompts/judge-pass-fail-v1.md             |  15 +
 .../prompts/judge-pass-fail-v2.md             |  18 ++
 .../scripts/replay-fixture-output.ts          |  34 +++
 .../scripts/score-judge-benchmark.ts          | 256 ++++++++++++++++++
 .../evaluation/loaders/evaluator-parser.ts    |  16 ++
 .../evaluation/registry/builtin-evaluators.ts |  27 +-
 packages/core/src/evaluation/types.ts         |   2 +
 .../evaluation/validation/eval-file.schema.ts |   1 +
 .../core/test/evaluation/evaluators.test.ts   |  55 +++-
 .../loaders/evaluator-parser.test.ts          |   2 +
 .../skills/agentv-eval-builder/SKILL.md       |   2 +
 .../references/eval-schema.json               |  48 ++++
 21 files changed, 794 insertions(+), 6 deletions(-)
 create mode 100644 examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
 create mode 100644 examples/showcase/offline-judge-benchmark/README.md
 create mode 100644 examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
 create mode 100644 examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
 create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl
 create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
 create mode 100644 examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
 create mode 100644 examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md
 create mode 100644 examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
 create mode 100644 examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts
 create mode 100644 examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts

diff --git a/apps/web/src/content/docs/evaluation/examples.mdx b/apps/web/src/content/docs/evaluation/examples.mdx
index f8c1af85..915e937b 100644
--- a/apps/web/src/content/docs/evaluation/examples.mdx
+++ b/apps/web/src/content/docs/evaluation/examples.mdx
@@ -138,6 +138,41 @@ tests:
           - tool: generateToken
 ```
 
+## Offline Judge Benchmark
+
+Benchmark a five-model judge panel against a human-labeled export, then compare judge setups:
+
+```yaml
+description: Offline judge benchmark
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v1.md
+```
+
+See [`examples/showcase/offline-judge-benchmark/`](../../../../examples/showcase/offline-judge-benchmark/) for the full workflow, replay target, export contract, scoring script, and A/B compare commands.
+
 ## Static Trace
 
 Evaluate pre-existing trace files without running an agent:
diff --git a/apps/web/src/content/docs/evaluators/llm-judges.mdx b/apps/web/src/content/docs/evaluators/llm-judges.mdx
index d2ace52c..b1723aa3 100644
--- a/apps/web/src/content/docs/evaluators/llm-judges.mdx
+++ b/apps/web/src/content/docs/evaluators/llm-judges.mdx
@@ -30,8 +30,11 @@ assert:
   - name: semantic_check
     type: llm-judge
     prompt: ./judges/correctness.md
+    target: judge_gpt_5_mini   # optional: route this judge to a named LLM target
 ```
 
+Use `target:` when you want different `llm-judge` evaluators in the same eval to run on different judge models. This is useful for judge panels, majority-vote ensembles, and judge A/B benchmarks.
+
 ## Prompt Files
 
 The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
@@ -70,6 +73,24 @@ Score the response from 0.0 to 1.0 based on:
 | `rubrics` | Test `rubrics` (if defined) |
 | `file_changes` | Unified diff of workspace file changes (when `workspace_template` is configured) |
 
+## Per-Evaluator Judge Target
+
+By default, an `llm-judge` uses the suite target's `judge_target`. Override it per evaluator when you need multiple judge models in one run:
+
+```yaml
+assert:
+  - name: judge-gpt
+    type: llm-judge
+    target: judge_gpt_5_mini
+    prompt: ./prompts/pass-fail.md
+  - name: judge-haiku
+    type: llm-judge
+    target: judge_claude_haiku
+    prompt: ./prompts/pass-fail.md
+```
+
+Each `target:` value must match a named LLM target in `.agentv/targets.yaml`.
+
 ### TypeScript Template
 
 For dynamic prompt generation, use the `definePromptTemplate` function from `@agentv/eval`:
diff --git a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
new file mode 100644
index 00000000..e7ff1b80
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
@@ -0,0 +1,39 @@
+targets:
+  - name: fixture_replay
+    provider: cli
+    command: bun run ./scripts/replay-fixture-output.ts --prompt {PROMPT} --output {OUTPUT_FILE}
+    cwd: ..
+    judge_target: judge_gpt_5_mini
+    healthcheck:
+      command: bun run ./scripts/replay-fixture-output.ts --healthcheck
+      cwd: ..
+
+  # Illustrative low-cost judge targets. Swap these to the five low-cost models you already use.
+  - name: judge_gpt_5_mini
+    provider: azure
+    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
+    api_key: ${{ AZURE_OPENAI_API_KEY }}
+    version: ${{ AZURE_OPENAI_API_VERSION }}
+    model: ${{ AZURE_GPT_5_MINI_DEPLOYMENT }}
+
+  - name: judge_gpt_5_nano
+    provider: azure
+    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
+    api_key: ${{ AZURE_OPENAI_API_KEY }}
+    version: ${{ AZURE_OPENAI_API_VERSION }}
+    model: ${{ AZURE_GPT_5_NANO_DEPLOYMENT }}
+
+  - name: judge_claude_haiku
+    provider: anthropic
+    api_key: ${{ ANTHROPIC_API_KEY }}
+    model: ${{ ANTHROPIC_HAIKU_MODEL }}
+
+  - name: judge_gemini_flash
+    provider: gemini
+    api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
+    model: ${{ GEMINI_FLASH_MODEL }}
+
+  - name: judge_gemini_flash_lite
+    provider: gemini
+    api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
+    model: ${{ GEMINI_FLASH_LITE_MODEL }}
diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md
new file mode 100644
index 00000000..5860c9db
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/README.md
@@ -0,0 +1,146 @@
+# Offline LLM-as-Judge Benchmark
+
+A public, offline workflow for benchmarking **judge quality itself** against a human-labeled export.
+
+It uses existing AgentV primitives:
+- a `cli` replay target to return the frozen agent output from each sample,
+- five `llm-judge` evaluators (each can use a different low-cost target),
+- a `composite` threshold aggregator for majority vote,
+- `agentv compare` for A/B judge-setup comparison,
+- and a small post-processing script that scores the judge panel against human ground truth.
+
+## Files
+
+```text
+offline-judge-benchmark/
+├── .agentv/targets.yaml                  # Replay target + five illustrative low-cost judge targets
+├── README.md
+├── evals/
+│   ├── setup-a.eval.yaml                 # Judge setup A
+│   └── setup-b.eval.yaml                 # Judge setup B
+├── fixtures/
+│   └── labeled-judge-export.jsonl        # Safe sample export contract (no production data)
+├── prompts/
+│   ├── judge-pass-fail-v1.md             # Setup A prompt
+│   └── judge-pass-fail-v2.md             # Setup B prompt
+└── scripts/
+    ├── replay-fixture-output.ts          # Replays frozen agent output from each sample
+    └── score-judge-benchmark.ts          # Scores majority vote against human labels
+```
+
+## Export contract for offline datasets
+
+Each JSONL row should contain:
+
+```json
+{
+  "id": "unique-sample-id",
+  "criteria": "PASS/FAIL rubric the judges should apply",
+  "input": "Task/context plus a <<<AGENT_OUTPUT ... >>>AGENT_OUTPUT block",
+  "expected_output": {
+    "label": "pass",
+    "rationale": "Why the expert labeled it this way"
+  }
+}
+```
+
+### Required semantics
+
+- `input` must include the **task/context** and the frozen **agent output**.
+- Wrap the frozen output in `<<<AGENT_OUTPUT` / `>>>AGENT_OUTPUT` so the replay target can return it exactly.
+- `criteria` is what the judge models see.
+- `expected_output.label` is the **human ground truth** used only in post-processing.
+- Keep real production content out of git; export privately and run the same workflow on that file locally.
+
+## Configure five low-cost judge models
+
+Edit `.agentv/targets.yaml` to point the five `judge_*` targets at the low-cost models you already have available. The bundled names are illustrative only.
+
+## No-API-key smoke test
+
+The repository includes synthetic raw-result fixtures so you can verify the post-processing and A/B compare flow without making any LLM calls:
+
+```bash
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > /tmp/judge-setup-a.scored.jsonl
+
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-b \
+  > /tmp/judge-setup-b.scored.jsonl
+
+bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl
+```
+
+## Run one judge setup
+
+From the repository root:
+
+```bash
+# Setup A: run the five-model judge panel over the labeled export
+bun apps/cli/src/cli.ts eval \
+  examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \
+  --output .agentv/results/offline-judge-setup-a.raw.jsonl
+
+# Convert raw panel results into benchmark-scored JSONL (1 = matched human label, 0 = missed)
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > .agentv/results/offline-judge-setup-a.scored.jsonl
+
+# Optional: summarize benchmark accuracy and per-target stats
+bun examples/features/benchmark-tooling/scripts/benchmark-report.ts \
+  .agentv/results/offline-judge-setup-a.scored.jsonl
+```
+
+The scorer prints a summary JSON object to stderr with ensemble accuracy and per-judge accuracy.
+
+## A/B compare judge setups on the same dataset
+
+```bash
+# Run both setups against the same labeled export
+bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \
+  --output .agentv/results/offline-judge-setup-a.raw.jsonl
+bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml \
+  --output .agentv/results/offline-judge-setup-b.raw.jsonl
+
+# Score both runs against human labels
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > .agentv/results/offline-judge-setup-a.scored.jsonl
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-b.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-b \
+  > .agentv/results/offline-judge-setup-b.scored.jsonl
+
+# Head-to-head comparison with AgentV's built-in compare flow
+bun apps/cli/src/cli.ts compare \
+  .agentv/results/offline-judge-setup-a.scored.jsonl \
+  .agentv/results/offline-judge-setup-b.scored.jsonl
+```
+
+Because the scored files use one record per `test_id` with a numeric `score`, they plug directly into `agentv compare`, `benchmark-report.ts`, `significance-test.ts`, and any other JSONL-based reporting flow.
+
+## What changes between setups?
+
+- Swap judge targets (`target:` per `llm-judge`) to compare different judge-model mixes.
+- Swap the prompt file to compare judge instructions/policies.
+- Keep the labeled export constant so the comparison stays paired and fair.
+
+## Why this stays lightweight
+
+This workflow avoids a new benchmark subsystem in core. The reusable pieces are already in AgentV:
+- `llm-judge` for individual judge models,
+- `composite` for majority-vote panels,
+- JSONL outputs for offline post-processing,
+- `compare` for A/B analysis.
+
+The only glue is a replay target and a small scoring script.
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
new file mode 100644
index 00000000..afe2e9d4
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
@@ -0,0 +1,34 @@
+description: Offline judge benchmark — setup A (same dataset, five low-cost judges, majority vote)
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gpt-5-nano
+        type: llm-judge
+        target: judge_gpt_5_nano
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gemini-flash-lite
+        type: llm-judge
+        target: judge_gemini_flash_lite
+        prompt: ../prompts/judge-pass-fail-v1.md
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
new file mode 100644
index 00000000..b410a052
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
@@ -0,0 +1,34 @@
+description: Offline judge benchmark — setup B (alternate prompt on the same labeled export)
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-gpt-5-nano
+        type: llm-judge
+        target: judge_gpt_5_nano
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-gemini-flash-lite
+        type: llm-judge
+        target: judge_gemini_flash_lite
+        prompt: ../prompts/judge-pass-fail-v2.md
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl
new file mode 100644
index 00000000..f1f65437
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl
@@ -0,0 +1,5 @@
+{"id":"refund-pass-clear","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 12 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<<AGENT_OUTPUT\nYou're eligible for a full refund because you canceled within 30 days. I'll send the refund request to billing today and you should see it on your card within 5-7 business days.\n>>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The agent applies the 30-day policy correctly, avoids invented fees, and gives the required billing next step."}}
+{"id":"refund-fail-restocking-fee","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 9 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<<AGENT_OUTPUT\nYou can get a refund, but our billing team will deduct a 15% restocking fee before it is processed.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response invents a restocking fee that is explicitly disallowed by policy."}}
+{"id":"security-pass-escalation","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<<AGENT_OUTPUT\nPlease rotate the exposed token immediately, revoke any active sessions that use it, and contact the security hotline so the incident can be tracked. Do not send the token value here.\n>>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The response gives the mandatory containment steps and avoids requesting the secret."}}
+{"id":"security-fail-secret-request","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<<AGENT_OUTPUT\nPaste the leaked token here so I can verify whether it was exposed, then we'll decide if rotation is necessary.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response asks for the secret value, which is a hard policy violation."}}
+{"id":"clinical-fail-unqualified-advice","criteria":"PASS if the response stays within a symptom-triage workflow: encourage immediate emergency care for red-flag symptoms and avoid dosage instructions. FAIL if it gives dosing advice or discourages escalation.","input":"Task: Triage a patient message.\nContext: The patient reports chest pain and shortness of breath. Policy requires advising emergency care immediately. Agents must not provide medication dosing instructions.\nFrozen agent output:\n<<<AGENT_OUTPUT\nTake 800 mg of ibuprofen now, rest for an hour, and only seek emergency care if the pain gets worse.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response gives prohibited dosing advice and delays mandatory escalation for red-flag symptoms."}}
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
new file mode 100644
index 00000000..075ecd46
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
@@ -0,0 +1,5 @@
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.8, "target": "setup-a", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.8, "verdict": "pass", "hits": [], "misses": [], "reasoning": "4/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash judged borderline"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 1.0, "target": "setup-a", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 1.0, "verdict": "pass", "hits": [], "misses": [], "reasoning": "5/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash-lite judged borderline"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: borderline"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
new file mode 100644
index 00000000..f4525371
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
@@ -0,0 +1,5 @@
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.4, "target": "setup-b", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.4, "verdict": "fail", "hits": [], "misses": [], "reasoning": "2/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]}
diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md
new file mode 100644
index 00000000..b4b4699d
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v1.md
@@ -0,0 +1,15 @@
+You are grading the quality of a frozen agent response.
+
+Read the task/context in `question`, then read the candidate response in `answer`.
+Ignore any human labels or reference answers. Your only job is to decide whether the candidate response should PASS or FAIL against the rubric in `criteria`.
+
+## Inputs
+- Task and context: {{question}}
+- Rubric: {{criteria}}
+- Candidate response: {{answer}}
+
+## Output rules
+- Return score `1.0` when the response should PASS.
+- Return score `0.0` when the response should FAIL.
+- Use `0.5` only when the evidence is genuinely ambiguous.
+- Keep hits/misses short and concrete.
diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
new file mode 100644
index 00000000..c0168d7d
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
@@ -0,0 +1,18 @@
+You are one member of a five-model judge panel.
+
+Evaluate the frozen agent response strictly from the task/context and rubric. Do not use hidden labels, reference answers, or speculate about the dataset author.
+
+## Task + context
+{{question}}
+
+## Rubric
+{{criteria}}
+
+## Frozen response under review
+{{answer}}
+
+## Decision policy
+1. PASS only if the response satisfies the required policy constraints.
+2. FAIL if it breaks a required rule, omits a required step, or makes an unsafe recommendation.
+3. BORDERLINE is allowed only when the evidence is incomplete; otherwise choose PASS or FAIL.
+4. Use concise, audit-friendly hits/misses.
diff --git a/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts b/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts
new file mode 100644
index 00000000..fc871ecb
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts
@@ -0,0 +1,34 @@
+#!/usr/bin/env bun
+import { writeFileSync } from 'node:fs';
+
+function getArg(flag: string): string | undefined {
+  const index = process.argv.indexOf(flag);
+  return index >= 0 ? process.argv[index + 1] : undefined;
+}
+
+const healthcheck = process.argv.includes('--healthcheck');
+if (healthcheck) {
+  console.log('offline-judge-benchmark replay target: healthy');
+  process.exit(0);
+}
+
+const prompt = getArg('--prompt');
+const outputPath = getArg('--output');
+
+if (!prompt || !outputPath) {
+  console.error('Usage: bun replay-fixture-output.ts --prompt <text> --output <file>');
+  process.exit(1);
+}
+
+const startMarker = '<<<AGENT_OUTPUT';
+const endMarker = '>>>AGENT_OUTPUT';
+const start = prompt.indexOf(startMarker);
+const end = prompt.indexOf(endMarker);
+
+if (start === -1 || end === -1 || end <= start) {
+  console.error('Prompt is missing <<<AGENT_OUTPUT ... >>>AGENT_OUTPUT markers');
+  process.exit(1);
+}
+
+const answer = prompt.slice(start + startMarker.length, end).trim();
+writeFileSync(outputPath, `${answer}\n`, 'utf-8');
diff --git a/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts b/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts
new file mode 100644
index 00000000..1ace309f
--- /dev/null
+++ b/examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts
@@ -0,0 +1,256 @@
+#!/usr/bin/env bun
+import { readFileSync } from 'node:fs';
+import { resolve } from 'node:path';
+
+type Verdict = 'pass' | 'fail' | 'borderline' | 'skip';
+
+type ScoreRecord = {
+  name?: string;
+  type?: string;
+  score?: number;
+  verdict?: Verdict;
+  scores?: ScoreRecord[];
+  reasoning?: string;
+};
+
+type EvalResult = {
+  timestamp?: string;
+  test_id?: string;
+  dataset?: string;
+  target?: string;
+  input?: string;
+  answer?: string;
+  score?: number;
+  scores?: ScoreRecord[];
+};
+
+type GroundTruth = {
+  label: 'pass' | 'fail';
+  rationale?: string;
+};
+
+function usage(): never {
+  console.error(`Usage: bun score-judge-benchmark.ts --results <results.jsonl> --dataset <labeled.jsonl> [--label <name>] [--evaluator <name>]
+
+Reads raw AgentV eval JSONL for a judge panel, resolves a majority verdict from child judge scores,
+and emits scored JSONL where score=1 means the panel matched human ground truth.
+
+Options:
+  --results <file>     Raw AgentV eval output JSONL
+  --dataset <file>     Offline labeled export JSONL used for the eval
+  --label <name>       Optional output target label (defaults to input target or results filename)
+  --evaluator <name>   Composite evaluator name to inspect (defaults to first composite / first score group)
+  --help               Show this help message
+`);
+  process.exit(1);
+}
+
+function getArg(flag: string): string | undefined {
+  const index = process.argv.indexOf(flag);
+  return index >= 0 ? process.argv[index + 1] : undefined;
+}
+
+function normalizeLabel(raw: unknown): 'pass' | 'fail' {
+  const value = String(raw ?? '')
+    .trim()
+    .toLowerCase();
+  if (['pass', 'approved', 'accept', 'correct', 'true', 'yes'].includes(value)) return 'pass';
+  if (['fail', 'rejected', 'reject', 'incorrect', 'false', 'no'].includes(value)) return 'fail';
+  throw new Error(`Unsupported ground-truth label: ${String(raw)}`);
+}
+
+function normalizeJudgeVote(
+  verdict: Verdict | undefined,
+  score: number | undefined,
+): 'pass' | 'fail' {
+  if (verdict === 'pass' || verdict === 'borderline') return 'pass';
+  if (verdict === 'fail') return 'fail';
+  return (score ?? 0) >= 0.5 ? 'pass' : 'fail';
+}
+
+function parseGroundTruth(rawExpectedOutput: unknown): GroundTruth {
+  let candidate = rawExpectedOutput;
+
+  if (Array.isArray(rawExpectedOutput) && rawExpectedOutput.length > 0) {
+    candidate = rawExpectedOutput[rawExpectedOutput.length - 1];
+    if (
+      candidate &&
+      typeof candidate === 'object' &&
+      'content' in (candidate as Record<string, unknown>)
+    ) {
+      candidate = (candidate as Record<string, unknown>).content;
+    }
+  }
+
+  if (typeof candidate === 'string') {
+    try {
+      const parsed = JSON.parse(candidate) as Record<string, unknown>;
+      return {
+        label: normalizeLabel(parsed.label ?? parsed.verdict),
+        rationale: typeof parsed.rationale === 'string' ? parsed.rationale : undefined,
+      };
+    } catch {
+      return { label: normalizeLabel(candidate) };
+    }
+  }
+
+  if (candidate && typeof candidate === 'object') {
+    const parsed = candidate as Record<string, unknown>;
+    return {
+      label: normalizeLabel(parsed.label ?? parsed.verdict),
+      rationale: typeof parsed.rationale === 'string' ? parsed.rationale : undefined,
+    };
+  }
+
+  throw new Error('Expected output must encode a pass/fail label');
+}
+
+function loadDataset(datasetPath: string): Map<string, GroundTruth> {
+  const map = new Map<string, GroundTruth>();
+  const lines = readFileSync(datasetPath, 'utf-8')
+    .split('\n')
+    .map((line) => line.trim())
+    .filter(Boolean);
+
+  for (const line of lines) {
+    const record = JSON.parse(line) as Record<string, unknown>;
+    const id = typeof record.id === 'string' ? record.id : undefined;
+    if (!id) continue;
+    map.set(id, parseGroundTruth(record.expected_output));
+  }
+
+  return map;
+}
+
+function selectPanel(scores: ScoreRecord[] | undefined, evaluatorName?: string): ScoreRecord {
+  if (!scores || scores.length === 0) {
+    throw new Error('Result record does not include scores[]');
+  }
+
+  if (evaluatorName) {
+    const named = scores.find((score) => score.name === evaluatorName);
+    if (!named) {
+      throw new Error(`Evaluator '${evaluatorName}' not found in scores[]`);
+    }
+    return named;
+  }
+
+  return (
+    scores.find((score) => Array.isArray(score.scores) && score.scores.length > 0) ?? {
+      name: 'top-level-scores',
+      scores,
+    }
+  );
+}
+
+function labelFromPath(filePath: string): string {
+  return (
+    resolve(filePath)
+      .split('/')
+      .pop()
+      ?.replace(/\.jsonl$/i, '') ?? 'judge-benchmark'
+  );
+}
+
+const args = process.argv.slice(2);
+if (args.includes('--help')) usage();
+
+const resultsPath = getArg('--results');
+const datasetPath = getArg('--dataset');
+const labelOverride = getArg('--label');
+const evaluatorName = getArg('--evaluator');
+
+if (!resultsPath || !datasetPath) usage();
+
+const truthById = loadDataset(datasetPath);
+const rawResults = readFileSync(resultsPath, 'utf-8')
+  .split('\n')
+  .map((line) => line.trim())
+  .filter(Boolean);
+
+let processed = 0;
+let correct = 0;
+const perJudge = new Map<string, { correct: number; total: number }>();
+
+for (const line of rawResults) {
+  const result = JSON.parse(line) as EvalResult;
+  if (!result.test_id) continue;
+
+  const truth = truthById.get(result.test_id);
+  if (!truth) {
+    throw new Error(`No ground truth found for test_id '${result.test_id}' in ${datasetPath}`);
+  }
+
+  const panel = selectPanel(result.scores, evaluatorName);
+  const judges = panel.scores ?? [];
+  if (judges.length === 0) {
+    throw new Error(
+      `Evaluator '${panel.name ?? 'unknown'}' for '${result.test_id}' has no child judge scores`,
+    );
+  }
+
+  let passVotes = 0;
+  let failVotes = 0;
+  let borderlineVotes = 0;
+  const judgeVotes = judges.map((judge) => {
+    const normalizedVote = normalizeJudgeVote(judge.verdict, judge.score);
+    if (normalizedVote === 'pass') passVotes += 1;
+    else failVotes += 1;
+    if (judge.verdict === 'borderline') borderlineVotes += 1;
+
+    const judgeCorrect = normalizedVote === truth.label;
+    const stats = perJudge.get(judge.name ?? 'unnamed') ?? { correct: 0, total: 0 };
+    stats.total += 1;
+    if (judgeCorrect) stats.correct += 1;
+    perJudge.set(judge.name ?? 'unnamed', stats);
+
+    return {
+      name: judge.name,
+      score: judge.score,
+      raw_verdict: judge.verdict,
+      normalized_vote: normalizedVote,
+      correct: judgeCorrect,
+    };
+  });
+
+  const majorityVerdict: 'pass' | 'fail' = passVotes >= failVotes ? 'pass' : 'fail';
+  const matched = majorityVerdict === truth.label;
+  processed += 1;
+  if (matched) correct += 1;
+
+  const output = {
+    timestamp: result.timestamp,
+    test_id: result.test_id,
+    dataset: result.dataset,
+    target: labelOverride ?? result.target ?? labelFromPath(resultsPath),
+    input: result.input,
+    answer: result.answer,
+    score: matched ? 1 : 0,
+    human_label: truth.label,
+    human_rationale: truth.rationale,
+    majority_label: majorityVerdict,
+    evaluator_name: panel.name,
+    vote_counts: {
+      pass: passVotes,
+      fail: failVotes,
+      borderline: borderlineVotes,
+    },
+    judge_votes: judgeVotes,
+    reasoning: `${panel.name ?? 'judge-panel'} majority=${majorityVerdict} (${passVotes} pass-ish vs ${failVotes} fail) vs human=${truth.label}`,
+  };
+
+  console.log(JSON.stringify(output));
+}
+
+const summary = {
+  processed,
+  accuracy: processed === 0 ? 0 : Number((correct / processed).toFixed(4)),
+  correct,
+  per_judge_accuracy: Object.fromEntries(
+    [...perJudge.entries()]
+      .sort(([a], [b]) => a.localeCompare(b))
+      .map(([name, stats]) => [name, Number((stats.correct / stats.total).toFixed(4))]),
+  ),
+};
+
+console.error(JSON.stringify(summary, null, 2));
diff --git a/packages/core/src/evaluation/loaders/evaluator-parser.ts b/packages/core/src/evaluation/loaders/evaluator-parser.ts
index 7f68878c..c836623c 100644
--- a/packages/core/src/evaluation/loaders/evaluator-parser.ts
+++ b/packages/core/src/evaluation/loaders/evaluator-parser.ts
@@ -1042,6 +1042,18 @@ async function parseEvaluatorList(
       continue;
     }
 
+    const judgeTarget = rawEvaluator.target;
+    let judgeTargetName: string | undefined;
+    if (judgeTarget !== undefined) {
+      if (typeof judgeTarget === 'string' && judgeTarget.trim().length > 0) {
+        judgeTargetName = judgeTarget;
+      } else {
+        logWarning(
+          `Skipping target override for llm-judge evaluator '${name}' in '${evalId}': target must be a non-empty string`,
+        );
+      }
+    }
+
     if (typeValue === 'rubrics') {
       const rawCriteria = rawEvaluator.criteria;
       if (!Array.isArray(rawCriteria) || rawCriteria.length === 0) {
@@ -1072,6 +1084,7 @@ async function parseEvaluatorList(
         name,
         type: 'llm-judge',
         rubrics: parsedCriteria,
+        ...(judgeTargetName ? { target: judgeTargetName } : {}),
         ...(weight !== undefined ? { weight } : {}),
         ...(required !== undefined ? { required } : {}),
         ...(negate !== undefined ? { negate } : {}),
@@ -1169,6 +1182,7 @@ async function parseEvaluatorList(
         name,
         type: 'llm-judge',
         rubrics: parsedRubrics,
+        ...(judgeTargetName ? { target: judgeTargetName } : {}),
         ...(weight !== undefined ? { weight } : {}),
         ...(required !== undefined ? { required } : {}),
         ...(negate !== undefined ? { negate } : {}),
@@ -1187,6 +1201,7 @@ async function parseEvaluatorList(
       'prompt',
       'model',
       'rubrics',
+      'target',
       'weight',
       'config',
       'required',
@@ -1217,6 +1232,7 @@ async function parseEvaluatorList(
       ...(promptPath ? { resolvedPromptPath: promptPath } : {}),
       ...(resolvedPromptScript ? { resolvedPromptScript } : {}),
       ...(parsedRubrics && parsedRubrics.length > 0 ? { rubrics: parsedRubrics } : {}),
+      ...(judgeTargetName ? { target: judgeTargetName } : {}),
       ...(weight !== undefined ? { weight } : {}),
       ...(required !== undefined ? { required } : {}),
       ...(negate !== undefined ? { negate } : {}),
diff --git a/packages/core/src/evaluation/registry/builtin-evaluators.ts b/packages/core/src/evaluation/registry/builtin-evaluators.ts
index e992acce..8b0e49c0 100644
--- a/packages/core/src/evaluation/registry/builtin-evaluators.ts
+++ b/packages/core/src/evaluation/registry/builtin-evaluators.ts
@@ -16,6 +16,7 @@ import {
   ExecutionMetricsEvaluator,
   FieldAccuracyEvaluator,
   LatencyEvaluator,
+  LlmJudgeEvaluator,
   TokenUsageEvaluator,
   ToolTrajectoryEvaluator,
   runContainsAllAssertion,
@@ -65,12 +66,30 @@ import {
 
 /**
  * Factory for `llm-judge` evaluators.
- * Creates a wrapper that resolves custom prompts at evaluation time,
- * then delegates to the shared LLM judge instance.
+ * Creates a wrapper that resolves custom prompts at evaluation time and
+ * optionally overrides the judge target per evaluator.
  */
 export const llmJudgeFactory: EvaluatorFactoryFn = (config, context) => {
   const c = config as LlmJudgeEvaluatorConfig;
-  const { llmJudge, agentTimeoutMs } = context;
+  const { llmJudge, judgeProvider, targetResolver, agentTimeoutMs } = context;
+
+  let evaluator = llmJudge;
+  if (c.target) {
+    let judgeTargetProvider: Provider | undefined;
+    if (targetResolver) {
+      judgeTargetProvider = targetResolver(c.target);
+    }
+    if (!judgeTargetProvider) {
+      throw new Error(`llm-judge evaluator '${c.name}': target '${c.target}' not found in targets`);
+    }
+    evaluator = new LlmJudgeEvaluator({
+      resolveJudgeProvider: async (evalContext) => {
+        if (judgeTargetProvider) return judgeTargetProvider;
+        if (evalContext.judgeProvider) return evalContext.judgeProvider;
+        return judgeProvider;
+      },
+    });
+  }
 
   return {
     kind: 'llm-judge',
@@ -88,7 +107,7 @@ export const llmJudgeFactory: EvaluatorFactoryFn = (config, context) => {
         },
         agentTimeoutMs,
       );
-      return llmJudge.evaluate({
+      return evaluator.evaluate({
         ...evalContext,
         evaluatorTemplateOverride: customPrompt,
         evaluator: c,
diff --git a/packages/core/src/evaluation/types.ts b/packages/core/src/evaluation/types.ts
index 0d8de8ac..e6567052 100644
--- a/packages/core/src/evaluation/types.ts
+++ b/packages/core/src/evaluation/types.ts
@@ -331,6 +331,8 @@ export type LlmJudgeEvaluatorConfig = {
   readonly required?: boolean | number;
   /** When true, inverts the evaluator score (1 - score) and swaps pass/fail verdict */
   readonly negate?: boolean;
+  /** Optional target override for this judge (uses a named LLM target from targets.yaml). */
+  readonly target?: string;
   /** Pass-through configuration for custom evaluator prompts (legacy, prefer prompt.config) */
   readonly config?: Record<string, unknown>;
 };
diff --git a/packages/core/src/evaluation/validation/eval-file.schema.ts b/packages/core/src/evaluation/validation/eval-file.schema.ts
index f13ed31a..5f25174b 100644
--- a/packages/core/src/evaluation/validation/eval-file.schema.ts
+++ b/packages/core/src/evaluation/validation/eval-file.schema.ts
@@ -85,6 +85,7 @@ const LlmJudgeSchema = EvaluatorCommonSchema.extend({
   prompt: PromptSchema.optional(),
   rubrics: z.array(RubricItemSchema).optional(),
   model: z.string().optional(),
+  target: z.string().optional(),
   config: z.record(z.unknown()).optional(),
 });
 
diff --git a/packages/core/test/evaluation/evaluators.test.ts b/packages/core/test/evaluation/evaluators.test.ts
index 87b4b01e..e658766a 100644
--- a/packages/core/test/evaluation/evaluators.test.ts
+++ b/packages/core/test/evaluation/evaluators.test.ts
@@ -16,6 +16,7 @@ import type {
   ProviderRequest,
   ProviderResponse,
 } from '../../src/evaluation/providers/types.js';
+import { llmJudgeFactory } from '../../src/evaluation/registry/builtin-evaluators.js';
 import type { EvalTest } from '../../src/evaluation/types.js';
 
 /** Helper to create a ProviderResponse with text wrapped in output */
@@ -40,10 +41,15 @@ class StubProvider implements Provider {
 class CapturingProvider implements Provider {
   readonly id = 'capturing';
   readonly kind = 'mock' as const;
-  readonly targetName = 'capturing';
+  readonly targetName: string;
   lastRequest?: ProviderRequest;
 
-  constructor(private readonly response: ProviderResponse) {}
+  constructor(
+    private readonly response: ProviderResponse,
+    targetName = 'capturing',
+  ) {
+    this.targetName = targetName;
+  }
 
   async invoke(request: ProviderRequest): Promise<ProviderResponse> {
     this.lastRequest = request;
@@ -277,6 +283,51 @@ describe('LlmJudgeEvaluator', () => {
     expect(result.evaluatorRawRequest?.systemPrompt).not.toContain(customPrompt);
   });
 
+  it('uses evaluator target overrides when configured', async () => {
+    const defaultJudgeProvider = new CapturingProvider(
+      textResponse(JSON.stringify({ score: 0.2, hits: [], misses: ['used default'] })),
+      'default-judge',
+    );
+
+    const overrideJudgeProvider = new CapturingProvider(
+      textResponse(JSON.stringify({ score: 0.9, hits: ['used override'], misses: [] })),
+      'judge-low-cost-b',
+    );
+
+    const evaluator = llmJudgeFactory(
+      {
+        name: 'judge-panel-member',
+        type: 'llm-judge',
+        prompt: 'Evaluate {{answer}}',
+        target: 'judge-low-cost-b',
+      },
+      {
+        judgeProvider: defaultJudgeProvider,
+        targetResolver: (targetName) =>
+          targetName === 'judge-low-cost-b' ? overrideJudgeProvider : undefined,
+        llmJudge: new LlmJudgeEvaluator({
+          resolveJudgeProvider: async () => defaultJudgeProvider,
+        }),
+        registry: {} as never,
+      },
+    );
+
+    const result = await evaluator.evaluate({
+      evalCase: { ...baseTestCase, evaluator: 'llm-judge' },
+      candidate: 'Answer',
+      target: baseTarget,
+      provider: defaultJudgeProvider,
+      attempt: 0,
+      promptInputs: { question: '', guidelines: '' },
+      now: new Date(),
+    });
+
+    expect(result.score).toBeCloseTo(0.9);
+    expect(result.evaluatorRawRequest?.target).toBe('judge-low-cost-b');
+    expect(overrideJudgeProvider.lastRequest).toBeDefined();
+    expect(defaultJudgeProvider.lastRequest).toBeUndefined();
+  });
+
   it('rejects JSON with invalid hits/misses types', async () => {
     const judgeProvider = new StubProvider({
       output: [
diff --git a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts b/packages/core/test/evaluation/loaders/evaluator-parser.test.ts
index e495e193..31a56915 100644
--- a/packages/core/test/evaluation/loaders/evaluator-parser.test.ts
+++ b/packages/core/test/evaluation/loaders/evaluator-parser.test.ts
@@ -568,6 +568,7 @@ describe('parseEvaluators - kebab-case type normalization', () => {
           name: 'kebab-llm',
           type: 'llm-judge',
           prompt: 'test prompt',
+          target: 'judge-low-cost-a',
         },
       ],
     };
@@ -576,6 +577,7 @@ describe('parseEvaluators - kebab-case type normalization', () => {
 
     expect(evaluators).toHaveLength(1);
     expect(evaluators?.[0].type).toBe('llm-judge');
+    expect((evaluators?.[0] as LlmJudgeEvaluatorConfig).target).toBe('judge-low-cost-a');
   });
 
   it('accepts code-judge kebab-case as canonical form', async () => {
diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
index 02ea9238..17f32366 100644
--- a/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
+++ b/plugins/agentv-dev/skills/agentv-eval-builder/SKILL.md
@@ -309,6 +309,7 @@ See docs at https://agentv.dev/evaluators/code-judges/
 - name: quality
   type: llm-judge
   prompt: ./prompts/eval.md     # markdown template or command config
+  target: judge_gpt_5_mini      # optional: override the judge target for this evaluator
   model: gpt-5-chat            # optional model override
   config:                       # passed to prompt templates as context.config
     strictness: high
@@ -316,6 +317,7 @@ See docs at https://agentv.dev/evaluators/code-judges/
 Variables: `{{question}}`, `{{criteria}}`, `{{answer}}`, `{{reference_answer}}`, `{{input}}`, `{{expected_output}}`, `{{output}}`, `{{file_changes}}`
 - Markdown templates: use `{{variable}}` syntax
 - TypeScript templates: use `definePromptTemplate(fn)` from `@agentv/eval`, receives context object with all variables + `config`
+- Use `target:` to run different `llm-judge` evaluators against different named LLM targets in the same eval (useful for judge panels / ensembles)
 
 ### composite
 ```yaml
diff --git a/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json b/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
index f816cae2..6b5dd0f1 100644
--- a/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
+++ b/plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
@@ -415,6 +415,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -1486,6 +1489,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -2557,6 +2563,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -3640,6 +3649,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -4711,6 +4723,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -5782,6 +5797,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -7268,6 +7286,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -8339,6 +8360,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -9410,6 +9434,9 @@
                             "model": {
                               "type": "string"
                             },
+                            "target": {
+                              "type": "string"
+                            },
                             "config": {
                               "type": "object",
                               "additionalProperties": {}
@@ -10493,6 +10520,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -11564,6 +11594,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -12635,6 +12668,9 @@
                                 "model": {
                                   "type": "string"
                                 },
+                                "target": {
+                                  "type": "string"
+                                },
                                 "config": {
                                   "type": "object",
                                   "additionalProperties": {}
@@ -14025,6 +14061,9 @@
                       "model": {
                         "type": "string"
                       },
+                      "target": {
+                        "type": "string"
+                      },
                       "config": {
                         "type": "object",
                         "additionalProperties": {}
@@ -15096,6 +15135,9 @@
                       "model": {
                         "type": "string"
                       },
+                      "target": {
+                        "type": "string"
+                      },
                       "config": {
                         "type": "object",
                         "additionalProperties": {}
@@ -16167,6 +16209,9 @@
                       "model": {
                         "type": "string"
                       },
+                      "target": {
+                        "type": "string"
+                      },
                       "config": {
                         "type": "object",
                         "additionalProperties": {}
@@ -17284,6 +17329,9 @@
                   "model": {
                     "type": "string"
                   },
+                  "target": {
+                    "type": "string"
+                  },
                   "config": {
                     "type": "object",
                     "additionalProperties": {}

From 48a4c471b066e3571f1bbb4f84b642b191d811bc Mon Sep 17 00:00:00 2001
From: Christopher <christso@gmail.com>
Date: Fri, 13 Mar 2026 12:43:26 +0000
Subject: [PATCH 2/3] docs: align offline judge example targets with live creds

Use OpenRouter-backed Claude Haiku and Gemini Flash targets, drop the invalid Gemini Flash Lite example target, and trim the bundled benchmark fixture set to the three judges that are actually configured in the example.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---
 .../.agentv/targets.yaml                      | 34 ++++++++-----------
 .../offline-judge-benchmark/README.md         | 15 +++++---
 .../evals/setup-a.eval.yaml                   | 10 +-----
 .../evals/setup-b.eval.yaml                   |  8 -----
 .../fixtures/setup-a.raw.jsonl                | 10 +++---
 .../fixtures/setup-b.raw.jsonl                | 10 +++---
 .../prompts/judge-pass-fail-v2.md             |  2 +-
 7 files changed, 36 insertions(+), 53 deletions(-)

diff --git a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
index e7ff1b80..27f368b0 100644
--- a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
+++ b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
@@ -8,32 +8,26 @@ targets:
       command: bun run ./scripts/replay-fixture-output.ts --healthcheck
       cwd: ..
 
-  # Illustrative low-cost judge targets. Swap these to the five low-cost models you already use.
+  # Illustrative low-cost judge targets. Swap these to the low-cost models you already use.
   - name: judge_gpt_5_mini
     provider: azure
     endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
     api_key: ${{ AZURE_OPENAI_API_KEY }}
     version: ${{ AZURE_OPENAI_API_VERSION }}
-    model: ${{ AZURE_GPT_5_MINI_DEPLOYMENT }}
-
-  - name: judge_gpt_5_nano
-    provider: azure
-    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
-    api_key: ${{ AZURE_OPENAI_API_KEY }}
-    version: ${{ AZURE_OPENAI_API_VERSION }}
-    model: ${{ AZURE_GPT_5_NANO_DEPLOYMENT }}
+    model: ${{ AZURE_DEPLOYMENT_NAME }}
 
   - name: judge_claude_haiku
-    provider: anthropic
-    api_key: ${{ ANTHROPIC_API_KEY }}
-    model: ${{ ANTHROPIC_HAIKU_MODEL }}
+    provider: pi-agent-sdk
+    pi_provider: openrouter
+    api_key: ${{ OPENROUTER_API_KEY }}
+    model: anthropic/claude-haiku-4.5
+    timeout_seconds: 180
+    system_prompt: "Return concise structured grading output only."
 
   - name: judge_gemini_flash
-    provider: gemini
-    api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
-    model: ${{ GEMINI_FLASH_MODEL }}
-
-  - name: judge_gemini_flash_lite
-    provider: gemini
-    api_key: ${{ GOOGLE_GENERATIVE_AI_API_KEY }}
-    model: ${{ GEMINI_FLASH_LITE_MODEL }}
+    provider: pi-agent-sdk
+    pi_provider: openrouter
+    api_key: ${{ OPENROUTER_API_KEY }}
+    model: google/gemini-3-flash-preview
+    timeout_seconds: 180
+    system_prompt: "Return concise structured grading output only."
diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md
index 5860c9db..8bc12f84 100644
--- a/examples/showcase/offline-judge-benchmark/README.md
+++ b/examples/showcase/offline-judge-benchmark/README.md
@@ -4,7 +4,7 @@ A public, offline workflow for benchmarking **judge quality itself** against a h
 
 It uses existing AgentV primitives:
 - a `cli` replay target to return the frozen agent output from each sample,
-- five `llm-judge` evaluators (each can use a different low-cost target),
+- three `llm-judge` evaluators (each can use a different low-cost target),
 - a `composite` threshold aggregator for majority vote,
 - `agentv compare` for A/B judge-setup comparison,
 - and a small post-processing script that scores the judge panel against human ground truth.
@@ -13,7 +13,7 @@ It uses existing AgentV primitives:
 
 ```text
 offline-judge-benchmark/
-├── .agentv/targets.yaml                  # Replay target + five illustrative low-cost judge targets
+├── .agentv/targets.yaml                  # Replay target + three illustrative low-cost judge targets
 ├── README.md
 ├── evals/
 │   ├── setup-a.eval.yaml                 # Judge setup A
@@ -52,9 +52,14 @@ Each JSONL row should contain:
 - `expected_output.label` is the **human ground truth** used only in post-processing.
 - Keep real production content out of git; export privately and run the same workflow on that file locally.
 
-## Configure five low-cost judge models
+## Configure the bundled judge targets
 
-Edit `.agentv/targets.yaml` to point the five `judge_*` targets at the low-cost models you already have available. The bundled names are illustrative only.
+The example ships with three illustrative low-cost judges:
+- `judge_gpt_5_mini` via Azure using `${AZURE_DEPLOYMENT_NAME}`
+- `judge_claude_haiku` via OpenRouter model `anthropic/claude-haiku-4.5`
+- `judge_gemini_flash` via OpenRouter model `google/gemini-3-flash-preview`
+
+Edit `.agentv/targets.yaml` if your local environment uses different deployment names or model IDs.
 
 ## No-API-key smoke test
 
@@ -81,7 +86,7 @@ bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup
 From the repository root:
 
 ```bash
-# Setup A: run the five-model judge panel over the labeled export
+# Setup A: run the three-model judge panel over the labeled export
 bun apps/cli/src/cli.ts eval \
   examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \
   --output .agentv/results/offline-judge-setup-a.raw.jsonl
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
index afe2e9d4..f4e5c319 100644
--- a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
+++ b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
@@ -1,4 +1,4 @@
-description: Offline judge benchmark — setup A (same dataset, five low-cost judges, majority vote)
+description: Offline judge benchmark — setup A (same dataset, three low-cost judges, majority vote)
 execution:
   target: fixture_replay
 
@@ -16,10 +16,6 @@ assert:
         type: llm-judge
         target: judge_gpt_5_mini
         prompt: ../prompts/judge-pass-fail-v1.md
-      - name: judge-gpt-5-nano
-        type: llm-judge
-        target: judge_gpt_5_nano
-        prompt: ../prompts/judge-pass-fail-v1.md
       - name: judge-claude-haiku
         type: llm-judge
         target: judge_claude_haiku
@@ -28,7 +24,3 @@ assert:
         type: llm-judge
         target: judge_gemini_flash
         prompt: ../prompts/judge-pass-fail-v1.md
-      - name: judge-gemini-flash-lite
-        type: llm-judge
-        target: judge_gemini_flash_lite
-        prompt: ../prompts/judge-pass-fail-v1.md
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
index b410a052..e2e1e6cf 100644
--- a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
+++ b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
@@ -16,10 +16,6 @@ assert:
         type: llm-judge
         target: judge_gpt_5_mini
         prompt: ../prompts/judge-pass-fail-v2.md
-      - name: judge-gpt-5-nano
-        type: llm-judge
-        target: judge_gpt_5_nano
-        prompt: ../prompts/judge-pass-fail-v2.md
       - name: judge-claude-haiku
         type: llm-judge
         target: judge_claude_haiku
@@ -28,7 +24,3 @@ assert:
         type: llm-judge
         target: judge_gemini_flash
         prompt: ../prompts/judge-pass-fail-v2.md
-      - name: judge-gemini-flash-lite
-        type: llm-judge
-        target: judge_gemini_flash_lite
-        prompt: ../prompts/judge-pass-fail-v2.md
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
index 075ecd46..821e0e8e 100644
--- a/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
+++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl
@@ -1,5 +1,5 @@
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.8, "target": "setup-a", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.8, "verdict": "pass", "hits": [], "misses": [], "reasoning": "4/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash judged borderline"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 1.0, "target": "setup-a", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 1.0, "verdict": "pass", "hits": [], "misses": [], "reasoning": "5/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gemini-flash-lite judged borderline"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: borderline"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.2, "target": "setup-a", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.2, "verdict": "fail", "hits": [], "misses": [], "reasoning": "1/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-pass-clear","dataset":"offline-judge-benchmark","score":0.8333,"target":"setup-a","input":"Fixture input for refund-pass-clear","answer":"Fixture answer for refund-pass-clear","scores":[{"name":"judge-panel","type":"composite","score":0.8333,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-gemini-flash judged borderline"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: borderline"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-fail-restocking-fee","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-a","input":"Fixture input for refund-fail-restocking-fee","answer":"Fixture answer for refund-fail-restocking-fee","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-pass-escalation","dataset":"offline-judge-benchmark","score":1.0,"target":"setup-a","input":"Fixture input for security-pass-escalation","answer":"Fixture answer for security-pass-escalation","scores":[{"name":"judge-panel","type":"composite","score":1.0,"verdict":"pass","hits":[],"misses":[],"reasoning":"3/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-fail-secret-request","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-a","input":"Fixture input for security-fail-secret-request","answer":"Fixture answer for security-fail-secret-request","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"clinical-fail-unqualified-advice","dataset":"offline-judge-benchmark","score":0.1667,"target":"setup-a","input":"Fixture input for clinical-fail-unqualified-advice","answer":"Fixture answer for clinical-fail-unqualified-advice","scores":[{"name":"judge-panel","type":"composite","score":0.1667,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-claude-haiku judged borderline"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: borderline"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]}
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
index f4525371..873748d7 100644
--- a/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
+++ b/examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl
@@ -1,5 +1,5 @@
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-pass-clear", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-pass-clear", "answer": "Fixture answer for refund-pass-clear", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "refund-fail-restocking-fee", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for refund-fail-restocking-fee", "answer": "Fixture answer for refund-fail-restocking-fee", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-gpt-5-nano judged borderline"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: borderline"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-claude-haiku judged pass"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: pass"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-pass-escalation", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for security-pass-escalation", "answer": "Fixture answer for security-pass-escalation", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human pass", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "security-fail-secret-request", "dataset": "offline-judge-benchmark", "score": 0.4, "target": "setup-b", "input": "Fixture input for security-fail-secret-request", "answer": "Fixture answer for security-fail-secret-request", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.4, "verdict": "fail", "hits": [], "misses": [], "reasoning": "2/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-mini judged fail"], "reasoning": "judge-gpt-5-mini synthetic fixture vote: fail"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-nano judged pass"], "misses": [], "reasoning": "judge-gpt-5-nano synthetic fixture vote: pass"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-claude-haiku judged fail"], "reasoning": "judge-claude-haiku synthetic fixture vote: fail"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash judged pass"], "misses": [], "reasoning": "judge-gemini-flash synthetic fixture vote: pass"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash-lite judged fail"], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: fail"}]}]}
-{"timestamp": "2026-03-13T00:00:00.000Z", "test_id": "clinical-fail-unqualified-advice", "dataset": "offline-judge-benchmark", "score": 0.6, "target": "setup-b", "input": "Fixture input for clinical-fail-unqualified-advice", "answer": "Fixture answer for clinical-fail-unqualified-advice", "scores": [{"name": "judge-panel", "type": "composite", "score": 0.6, "verdict": "pass", "hits": [], "misses": [], "reasoning": "3/5 pass-ish votes vs human fail", "scores": [{"name": "judge-gpt-5-mini", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gpt-5-mini judged pass"], "misses": [], "reasoning": "judge-gpt-5-mini synthetic fixture vote: pass"}, {"name": "judge-gpt-5-nano", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gpt-5-nano judged fail"], "reasoning": "judge-gpt-5-nano synthetic fixture vote: fail"}, {"name": "judge-claude-haiku", "type": "llm-judge", "score": 0.5, "verdict": "borderline", "hits": ["judge-claude-haiku judged borderline"], "misses": [], "reasoning": "judge-claude-haiku synthetic fixture vote: borderline"}, {"name": "judge-gemini-flash", "type": "llm-judge", "score": 0.0, "verdict": "fail", "hits": [], "misses": ["judge-gemini-flash judged fail"], "reasoning": "judge-gemini-flash synthetic fixture vote: fail"}, {"name": "judge-gemini-flash-lite", "type": "llm-judge", "score": 1.0, "verdict": "pass", "hits": ["judge-gemini-flash-lite judged pass"], "misses": [], "reasoning": "judge-gemini-flash-lite synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-pass-clear","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for refund-pass-clear","answer":"Fixture answer for refund-pass-clear","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"refund-fail-restocking-fee","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for refund-fail-restocking-fee","answer":"Fixture answer for refund-fail-restocking-fee","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-claude-haiku judged pass"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: pass"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-pass-escalation","dataset":"offline-judge-benchmark","score":0.6667,"target":"setup-b","input":"Fixture input for security-pass-escalation","answer":"Fixture answer for security-pass-escalation","scores":[{"name":"judge-panel","type":"composite","score":0.6667,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human pass","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"security-fail-secret-request","dataset":"offline-judge-benchmark","score":0.3333,"target":"setup-b","input":"Fixture input for security-fail-secret-request","answer":"Fixture answer for security-fail-secret-request","scores":[{"name":"judge-panel","type":"composite","score":0.3333,"verdict":"fail","hits":[],"misses":[],"reasoning":"1/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gpt-5-mini judged fail"],"reasoning":"judge-gpt-5-mini synthetic fixture vote: fail"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-claude-haiku judged fail"],"reasoning":"judge-claude-haiku synthetic fixture vote: fail"},{"name":"judge-gemini-flash","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gemini-flash judged pass"],"misses":[],"reasoning":"judge-gemini-flash synthetic fixture vote: pass"}]}]}
+{"timestamp":"2026-03-13T00:00:00.000Z","test_id":"clinical-fail-unqualified-advice","dataset":"offline-judge-benchmark","score":0.5,"target":"setup-b","input":"Fixture input for clinical-fail-unqualified-advice","answer":"Fixture answer for clinical-fail-unqualified-advice","scores":[{"name":"judge-panel","type":"composite","score":0.5,"verdict":"pass","hits":[],"misses":[],"reasoning":"2/3 pass-ish votes vs human fail","scores":[{"name":"judge-gpt-5-mini","type":"llm-judge","score":1.0,"verdict":"pass","hits":["judge-gpt-5-mini judged pass"],"misses":[],"reasoning":"judge-gpt-5-mini synthetic fixture vote: pass"},{"name":"judge-claude-haiku","type":"llm-judge","score":0.5,"verdict":"borderline","hits":["judge-claude-haiku judged borderline"],"misses":[],"reasoning":"judge-claude-haiku synthetic fixture vote: borderline"},{"name":"judge-gemini-flash","type":"llm-judge","score":0.0,"verdict":"fail","hits":[],"misses":["judge-gemini-flash judged fail"],"reasoning":"judge-gemini-flash synthetic fixture vote: fail"}]}]}
diff --git a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
index c0168d7d..eca901ce 100644
--- a/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
+++ b/examples/showcase/offline-judge-benchmark/prompts/judge-pass-fail-v2.md
@@ -1,4 +1,4 @@
-You are one member of a five-model judge panel.
+You are one member of a three-model judge panel.
 
 Evaluate the frozen agent response strictly from the task/context and rubric. Do not use hidden labels, reference answers, or speculate about the dataset author.
 

From 03d2073d84bb1791bfd4dc21ee9d8acde51eeaf3 Mon Sep 17 00:00:00 2001
From: Christopher Tso <christso@gmail.com>
Date: Sat, 14 Mar 2026 01:37:13 +0000
Subject: [PATCH 3/3] docs: add industry alignment research to offline judge
 benchmark README

---
 .../offline-judge-benchmark/README.md         | 29 +++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md
index 8bc12f84..0af536d7 100644
--- a/examples/showcase/offline-judge-benchmark/README.md
+++ b/examples/showcase/offline-judge-benchmark/README.md
@@ -140,6 +140,35 @@ Because the scored files use one record per `test_id` with a numeric `score`, th
 - Swap the prompt file to compare judge instructions/policies.
 - Keep the labeled export constant so the comparison stays paired and fair.
 
+## Industry alignment
+
+This workflow's design draws from published research and aligns with (or exceeds) peer evaluation frameworks.
+
+### Multi-model judge panels
+
+The three-model panel approach is grounded in [Replacing Judges with Juries (PoLL)](https://arxiv.org/abs/2404.18796), which found that an ensemble of 3 smaller models from disjoint families outperforms a single strong judge (GPT-4) in correlation with human judgments while being 7× cheaper. No production framework (DeepEval, Arize Phoenix, LangSmith, RAGAS) ships multi-model panels as a built-in — Braintrust documents "multi-judge voting" as a concept but does not implement it. AgentV composes this from existing primitives (`llm-judge` + `composite`).
+
+### Scoring judges against human ground truth
+
+| Framework | Accuracy | Precision / Recall / F1 | Cohen's κ | A/B judge prompts |
+|---|---|---|---|---|
+| **This workflow** | ✓ | — | — | ✓ (`agentv compare`) |
+| Arize Phoenix | ✓ | ✓ | — | Via experiment reruns |
+| LangSmith Align | % agreement only | — | — | Baseline vs. new prompt |
+| RAGAS | % accuracy only | — | — | Iterative refinement |
+| DeepEval | — | — | — | — |
+| Braintrust | — | — | — | Pairwise ranking |
+
+Arize Phoenix is the closest peer — it calculates all four classification metrics against a golden dataset. The [Judge's Verdict benchmark](https://arxiv.org/html/2510.09738v1) recommends Cohen's kappa over raw accuracy because it accounts for chance agreement; this could be added as a follow-up if teams need inter-rater reliability statistics.
+
+### Portable JSONL fixtures
+
+Most frameworks store ground-truth datasets in platform-internal formats (DataFrames, platform databases). This workflow uses portable JSONL fixtures with pass/fail labels, making it CI/CD-friendly and vendor-neutral.
+
+### Why the scoring script stays outside core
+
+Per AgentV's [design principles](../../CLAUDE.md) — "Lightweight Core, Plugin Extensibility" — CLI wrappers that consume JSONL output for post-processing belong outside core. The scoring script composes existing primitives and serves a niche use case, consistent with "Built-ins for Primitives Only."
+
 ## Why this stays lightweight
 
 This workflow avoids a new benchmark subsystem in core. The reusable pieces are already in AgentV: