feat(evaluation): add offline judge benchmark workflow by christso · Pull Request #561 · EntityProcess/agentv

christso · 2026-03-13T12:21:43Z

Why

Teams want to use LLM judges to evaluate production agent behavior, but before trusting those judges they need a safe way to measure judge accuracy itself against human-labeled examples. This PR adds the missing benchmark workflow so teams can export sanitized production samples, run a three-model low-cost judge panel offline, compare judge decisions to domain-expert ground truth, and A/B test judge prompts or model mixes before relying on them in broader evaluation programs.

Summary

add a public offline LLM-as-judge benchmark workflow with safe labeled export fixtures, replay target glue, and scoring script
support per-llm-judge target: overrides so one eval can run a three-model judge panel
document single-run and A/B judge-setup comparison flows across docs and eval-builder skill references

Validation

bun test packages/core/test/evaluation/evaluators.test.ts packages/core/test/evaluation/loaders/evaluator-parser.test.ts
bun test packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts
bun run build
bunx biome check packages/core/src/evaluation/types.ts packages/core/src/evaluation/validation/eval-file.schema.ts packages/core/src/evaluation/loaders/evaluator-parser.ts packages/core/src/evaluation/registry/builtin-evaluators.ts packages/core/test/evaluation/evaluators.test.ts packages/core/test/evaluation/loaders/evaluator-parser.test.ts apps/web/src/content/docs/evaluators/llm-judges.mdx apps/web/src/content/docs/evaluation/examples.mdx examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
bun examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts --prompt $'Task\n<<<AGENT_OUTPUT\nhello world\n>>>AGENT_OUTPUT' --output /tmp/offline-judge-replay.txt
bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts --results examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl --label judge-setup-a > /tmp/judge-setup-a.scored.jsonl
bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts --results examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl --label judge-setup-b > /tmp/judge-setup-b.scored.jsonl
bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl (expected regression exit 1 with 0 wins / 2 losses / 3 ties)
bun apps/cli/dist/cli.js compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl --json (expected regression exit 1)
bun examples/features/benchmark-tooling/scripts/benchmark-report.ts /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl --json

Risk

High — additive public YAML schema change (llm-judge.target) plus new public benchmark workflow example/docs.

Deploying agentv with Cloudflare Pages

Latest commit:	`03d2073`
Status:	✅ Deploy successful!
Preview URL:	https://6cea2290.agentv.pages.dev
Branch Preview URL:	https://feat-offline-judge-benchmark.agentv.pages.dev

View logs

Use OpenRouter-backed Claude Haiku and Gemini Flash targets, drop the invalid Gemini Flash Lite example target, and trim the bundled benchmark fixture set to the three judges that are actually configured in the example.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

christso · 2026-03-14T01:29:13Z

Nit: apps/web/src/content/docs/evaluation/examples.mdx still says "Benchmark a five-model judge panel" — should be "three-model" to match the actual YAML (3 judges: gpt-5-mini, claude-haiku, gemini-flash). PR body has been corrected.

docs: add industry alignment research to offline judge benchmark README

03d2073

christso merged commit c99a749 into main Mar 14, 2026
1 check passed

christso deleted the feat/offline-judge-benchmark branch March 14, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evaluation): add offline judge benchmark workflow#561

feat(evaluation): add offline judge benchmark workflow#561
christso merged 3 commits intomainfrom
feat/offline-judge-benchmark

christso commented Mar 13, 2026 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

christso commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

christso commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Summary

Validation

Risk

Related

Uh oh!

cloudflare-workers-and-pages bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying agentv with Cloudflare Pages

Uh oh!

christso commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

christso commented Mar 13, 2026 •

edited

Loading

cloudflare-workers-and-pages bot commented Mar 13, 2026 •

edited

Loading