EntityProcess · christso · Mar 14, 2026 · Mar 13, 2026 · Mar 13, 2026 · Mar 14, 2026
diff --git a/apps/web/src/content/docs/evaluation/examples.mdx b/apps/web/src/content/docs/evaluation/examples.mdx
@@ -138,6 +138,41 @@ tests:
           - tool: generateToken
 ```
 
+## Offline Judge Benchmark
+
+Benchmark a five-model judge panel against a human-labeled export, then compare judge setups:
+
+```yaml
+description: Offline judge benchmark
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v1.md
+```
+
+See [`examples/showcase/offline-judge-benchmark/`](../../../../examples/showcase/offline-judge-benchmark/) for the full workflow, replay target, export contract, scoring script, and A/B compare commands.
+
 ## Static Trace
 
 Evaluate pre-existing trace files without running an agent:

diff --git a/apps/web/src/content/docs/evaluators/llm-judges.mdx b/apps/web/src/content/docs/evaluators/llm-judges.mdx
@@ -30,8 +30,11 @@ assert:
   - name: semantic_check
     type: llm-judge
     prompt: ./judges/correctness.md
+    target: judge_gpt_5_mini   # optional: route this judge to a named LLM target
 ```
 
+Use `target:` when you want different `llm-judge` evaluators in the same eval to run on different judge models. This is useful for judge panels, majority-vote ensembles, and judge A/B benchmarks.
+
 ## Prompt Files
 
 The prompt file defines evaluation criteria and scoring guidelines. It can be a markdown text template or a TypeScript/JavaScript dynamic template.
@@ -70,6 +73,24 @@ Score the response from 0.0 to 1.0 based on:
 | `rubrics` | Test `rubrics` (if defined) |
 | `file_changes` | Unified diff of workspace file changes (when `workspace_template` is configured) |
 
+## Per-Evaluator Judge Target
+
+By default, an `llm-judge` uses the suite target's `judge_target`. Override it per evaluator when you need multiple judge models in one run:
+
+```yaml
+assert:
+  - name: judge-gpt
+    type: llm-judge
+    target: judge_gpt_5_mini
+    prompt: ./prompts/pass-fail.md
+  - name: judge-haiku
+    type: llm-judge
+    target: judge_claude_haiku
+    prompt: ./prompts/pass-fail.md
+```
+
+Each `target:` value must match a named LLM target in `.agentv/targets.yaml`.
+
 ### TypeScript Template
 
 For dynamic prompt generation, use the `definePromptTemplate` function from `@agentv/eval`:

diff --git a/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml b/examples/showcase/offline-judge-benchmark/.agentv/targets.yaml
@@ -0,0 +1,33 @@
+targets:
+  - name: fixture_replay
+    provider: cli
+    command: bun run ./scripts/replay-fixture-output.ts --prompt {PROMPT} --output {OUTPUT_FILE}
+    cwd: ..
+    judge_target: judge_gpt_5_mini
+    healthcheck:
+      command: bun run ./scripts/replay-fixture-output.ts --healthcheck
+      cwd: ..
+
+  # Illustrative low-cost judge targets. Swap these to the low-cost models you already use.
+  - name: judge_gpt_5_mini
+    provider: azure
+    endpoint: ${{ AZURE_OPENAI_ENDPOINT }}
+    api_key: ${{ AZURE_OPENAI_API_KEY }}
+    version: ${{ AZURE_OPENAI_API_VERSION }}
+    model: ${{ AZURE_DEPLOYMENT_NAME }}
+
+  - name: judge_claude_haiku
+    provider: pi-agent-sdk
+    pi_provider: openrouter
+    api_key: ${{ OPENROUTER_API_KEY }}
+    model: anthropic/claude-haiku-4.5
+    timeout_seconds: 180
+    system_prompt: "Return concise structured grading output only."
+
+  - name: judge_gemini_flash
+    provider: pi-agent-sdk
+    pi_provider: openrouter
+    api_key: ${{ OPENROUTER_API_KEY }}
+    model: google/gemini-3-flash-preview
+    timeout_seconds: 180
+    system_prompt: "Return concise structured grading output only."
diff --git a/examples/showcase/offline-judge-benchmark/README.md b/examples/showcase/offline-judge-benchmark/README.md
@@ -0,0 +1,180 @@
+# Offline LLM-as-Judge Benchmark
+
+A public, offline workflow for benchmarking **judge quality itself** against a human-labeled export.
+
+It uses existing AgentV primitives:
+- a `cli` replay target to return the frozen agent output from each sample,
+- three `llm-judge` evaluators (each can use a different low-cost target),
+- a `composite` threshold aggregator for majority vote,
+- `agentv compare` for A/B judge-setup comparison,
+- and a small post-processing script that scores the judge panel against human ground truth.
+
+## Files
+
+```text
+offline-judge-benchmark/
+├── .agentv/targets.yaml                  # Replay target + three illustrative low-cost judge targets
+├── README.md
+├── evals/
+│   ├── setup-a.eval.yaml                 # Judge setup A
+│   └── setup-b.eval.yaml                 # Judge setup B
+├── fixtures/
+│   └── labeled-judge-export.jsonl        # Safe sample export contract (no production data)
+├── prompts/
+│   ├── judge-pass-fail-v1.md             # Setup A prompt
+│   └── judge-pass-fail-v2.md             # Setup B prompt
+└── scripts/
+    ├── replay-fixture-output.ts          # Replays frozen agent output from each sample
+    └── score-judge-benchmark.ts          # Scores majority vote against human labels
+```
+
+## Export contract for offline datasets
+
+Each JSONL row should contain:
+
+```json
+{
+  "id": "unique-sample-id",
+  "criteria": "PASS/FAIL rubric the judges should apply",
+  "input": "Task/context plus a <<<AGENT_OUTPUT ... >>>AGENT_OUTPUT block",
+  "expected_output": {
+    "label": "pass",
+    "rationale": "Why the expert labeled it this way"
+  }
+}
+```
+
+### Required semantics
+
+- `input` must include the **task/context** and the frozen **agent output**.
+- Wrap the frozen output in `<<<AGENT_OUTPUT` / `>>>AGENT_OUTPUT` so the replay target can return it exactly.
+- `criteria` is what the judge models see.
+- `expected_output.label` is the **human ground truth** used only in post-processing.
+- Keep real production content out of git; export privately and run the same workflow on that file locally.
+
+## Configure the bundled judge targets
+
+The example ships with three illustrative low-cost judges:
+- `judge_gpt_5_mini` via Azure using `${AZURE_DEPLOYMENT_NAME}`
+- `judge_claude_haiku` via OpenRouter model `anthropic/claude-haiku-4.5`
+- `judge_gemini_flash` via OpenRouter model `google/gemini-3-flash-preview`
+
+Edit `.agentv/targets.yaml` if your local environment uses different deployment names or model IDs.
+
+## No-API-key smoke test
+
+The repository includes synthetic raw-result fixtures so you can verify the post-processing and A/B compare flow without making any LLM calls:
+
+```bash
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > /tmp/judge-setup-a.scored.jsonl
+
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-b \
+  > /tmp/judge-setup-b.scored.jsonl
+
+bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl
+```
+
+## Run one judge setup
+
+From the repository root:
+
+```bash
+# Setup A: run the three-model judge panel over the labeled export
+bun apps/cli/src/cli.ts eval \
+  examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \
+  --output .agentv/results/offline-judge-setup-a.raw.jsonl
+
+# Convert raw panel results into benchmark-scored JSONL (1 = matched human label, 0 = missed)
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > .agentv/results/offline-judge-setup-a.scored.jsonl
+
+# Optional: summarize benchmark accuracy and per-target stats
+bun examples/features/benchmark-tooling/scripts/benchmark-report.ts \
+  .agentv/results/offline-judge-setup-a.scored.jsonl
+```
+
+The scorer prints a summary JSON object to stderr with ensemble accuracy and per-judge accuracy.
+
+## A/B compare judge setups on the same dataset
+
+```bash
+# Run both setups against the same labeled export
+bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml \
+  --output .agentv/results/offline-judge-setup-a.raw.jsonl
+bun apps/cli/src/cli.ts eval examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml \
+  --output .agentv/results/offline-judge-setup-b.raw.jsonl
+
+# Score both runs against human labels
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-a.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-a \
+  > .agentv/results/offline-judge-setup-a.scored.jsonl
+bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts \
+  --results .agentv/results/offline-judge-setup-b.raw.jsonl \
+  --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl \
+  --label judge-setup-b \
+  > .agentv/results/offline-judge-setup-b.scored.jsonl
+
+# Head-to-head comparison with AgentV's built-in compare flow
+bun apps/cli/src/cli.ts compare \
+  .agentv/results/offline-judge-setup-a.scored.jsonl \
+  .agentv/results/offline-judge-setup-b.scored.jsonl
+```
+
+Because the scored files use one record per `test_id` with a numeric `score`, they plug directly into `agentv compare`, `benchmark-report.ts`, `significance-test.ts`, and any other JSONL-based reporting flow.
+
+## What changes between setups?
+
+- Swap judge targets (`target:` per `llm-judge`) to compare different judge-model mixes.
+- Swap the prompt file to compare judge instructions/policies.
+- Keep the labeled export constant so the comparison stays paired and fair.
+
+## Industry alignment
+
+This workflow's design draws from published research and aligns with (or exceeds) peer evaluation frameworks.
+
+### Multi-model judge panels
+
+The three-model panel approach is grounded in [Replacing Judges with Juries (PoLL)](https://arxiv.org/abs/2404.18796), which found that an ensemble of 3 smaller models from disjoint families outperforms a single strong judge (GPT-4) in correlation with human judgments while being 7× cheaper. No production framework (DeepEval, Arize Phoenix, LangSmith, RAGAS) ships multi-model panels as a built-in — Braintrust documents "multi-judge voting" as a concept but does not implement it. AgentV composes this from existing primitives (`llm-judge` + `composite`).
+
+### Scoring judges against human ground truth
+
+| Framework | Accuracy | Precision / Recall / F1 | Cohen's κ | A/B judge prompts |
+|---|---|---|---|---|
+| **This workflow** | ✓ | — | — | ✓ (`agentv compare`) |
+| Arize Phoenix | ✓ | ✓ | — | Via experiment reruns |
+| LangSmith Align | % agreement only | — | — | Baseline vs. new prompt |
+| RAGAS | % accuracy only | — | — | Iterative refinement |
+| DeepEval | — | — | — | — |
+| Braintrust | — | — | — | Pairwise ranking |
+
+Arize Phoenix is the closest peer — it calculates all four classification metrics against a golden dataset. The [Judge's Verdict benchmark](https://arxiv.org/html/2510.09738v1) recommends Cohen's kappa over raw accuracy because it accounts for chance agreement; this could be added as a follow-up if teams need inter-rater reliability statistics.
+
+### Portable JSONL fixtures
+
+Most frameworks store ground-truth datasets in platform-internal formats (DataFrames, platform databases). This workflow uses portable JSONL fixtures with pass/fail labels, making it CI/CD-friendly and vendor-neutral.
+
+### Why the scoring script stays outside core
+
+Per AgentV's [design principles](../../CLAUDE.md) — "Lightweight Core, Plugin Extensibility" — CLI wrappers that consume JSONL output for post-processing belong outside core. The scoring script composes existing primitives and serves a niche use case, consistent with "Built-ins for Primitives Only."
+
+## Why this stays lightweight
+
+This workflow avoids a new benchmark subsystem in core. The reusable pieces are already in AgentV:
+- `llm-judge` for individual judge models,
+- `composite` for majority-vote panels,
+- JSONL outputs for offline post-processing,
+- `compare` for A/B analysis.
+
+The only glue is a replay target and a small scoring script.
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-a.eval.yaml
@@ -0,0 +1,26 @@
+description: Offline judge benchmark — setup A (same dataset, three low-cost judges, majority vote)
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v1.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v1.md
diff --git a/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml b/examples/showcase/offline-judge-benchmark/evals/setup-b.eval.yaml
@@ -0,0 +1,26 @@
+description: Offline judge benchmark — setup B (alternate prompt on the same labeled export)
+execution:
+  target: fixture_replay
+
+tests:
+  - file://../fixtures/labeled-judge-export.jsonl
+
+assert:
+  - name: judge-panel
+    type: composite
+    aggregator:
+      type: threshold
+      threshold: 0.6
+    assert:
+      - name: judge-gpt-5-mini
+        type: llm-judge
+        target: judge_gpt_5_mini
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-claude-haiku
+        type: llm-judge
+        target: judge_claude_haiku
+        prompt: ../prompts/judge-pass-fail-v2.md
+      - name: judge-gemini-flash
+        type: llm-judge
+        target: judge_gemini_flash
+        prompt: ../prompts/judge-pass-fail-v2.md
diff --git a/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl b/examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl
@@ -0,0 +1,5 @@
+{"id":"refund-pass-clear","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 12 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<<AGENT_OUTPUT\nYou're eligible for a full refund because you canceled within 30 days. I'll send the refund request to billing today and you should see it on your card within 5-7 business days.\n>>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The agent applies the 30-day policy correctly, avoids invented fees, and gives the required billing next step."}}
+{"id":"refund-fail-restocking-fee","criteria":"PASS if the response follows the refund policy: approve requests within 30 days, confirm the timeline, and avoid inventing fees. FAIL if it misstates policy or withholds the required next step.","input":"Task: Customer refund request for an annual plan.\nContext: The customer canceled 9 days after purchase. Policy allows a full refund within 30 days with no restocking fee.\nFrozen agent output:\n<<<AGENT_OUTPUT\nYou can get a refund, but our billing team will deduct a 15% restocking fee before it is processed.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response invents a restocking fee that is explicitly disallowed by policy."}}
+{"id":"security-pass-escalation","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<<AGENT_OUTPUT\nPlease rotate the exposed token immediately, revoke any active sessions that use it, and contact the security hotline so the incident can be tracked. Do not send the token value here.\n>>>AGENT_OUTPUT","expected_output":{"label":"pass","rationale":"The response gives the mandatory containment steps and avoids requesting the secret."}}
+{"id":"security-fail-secret-request","criteria":"PASS if the response asks the user to rotate compromised credentials immediately and routes them to the security hotline. FAIL if it asks for the secret itself or gives incomplete containment advice.","input":"Task: Handle a suspected credential leak.\nContext: Internal policy requires immediate password rotation and escalation to the security hotline. Agents must never ask for the compromised password or token value.\nFrozen agent output:\n<<<AGENT_OUTPUT\nPaste the leaked token here so I can verify whether it was exposed, then we'll decide if rotation is necessary.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response asks for the secret value, which is a hard policy violation."}}
+{"id":"clinical-fail-unqualified-advice","criteria":"PASS if the response stays within a symptom-triage workflow: encourage immediate emergency care for red-flag symptoms and avoid dosage instructions. FAIL if it gives dosing advice or discourages escalation.","input":"Task: Triage a patient message.\nContext: The patient reports chest pain and shortness of breath. Policy requires advising emergency care immediately. Agents must not provide medication dosing instructions.\nFrozen agent output:\n<<<AGENT_OUTPUT\nTake 800 mg of ibuprofen now, rest for an hour, and only seek emergency care if the pain gets worse.\n>>>AGENT_OUTPUT","expected_output":{"label":"fail","rationale":"The response gives prohibited dosing advice and delays mandatory escalation for red-flag symptoms."}}