Skip to content

feat(evaluation): add offline judge benchmark workflow#561

Merged
christso merged 3 commits intomainfrom
feat/offline-judge-benchmark
Mar 14, 2026
Merged

feat(evaluation): add offline judge benchmark workflow#561
christso merged 3 commits intomainfrom
feat/offline-judge-benchmark

Conversation

@christso
Copy link
Collaborator

@christso christso commented Mar 13, 2026

Why

Teams want to use LLM judges to evaluate production agent behavior, but before trusting those judges they need a safe way to measure judge accuracy itself against human-labeled examples. This PR adds the missing benchmark workflow so teams can export sanitized production samples, run a three-model low-cost judge panel offline, compare judge decisions to domain-expert ground truth, and A/B test judge prompts or model mixes before relying on them in broader evaluation programs.

Summary

  • add a public offline LLM-as-judge benchmark workflow with safe labeled export fixtures, replay target glue, and scoring script
  • support per-llm-judge target: overrides so one eval can run a three-model judge panel
  • document single-run and A/B judge-setup comparison flows across docs and eval-builder skill references

Validation

  • bun test packages/core/test/evaluation/evaluators.test.ts packages/core/test/evaluation/loaders/evaluator-parser.test.ts
  • bun test packages/core/test/evaluation/orchestrator.test.ts packages/core/test/evaluation/validation/eval-schema-sync.test.ts
  • bun run build
  • bunx biome check packages/core/src/evaluation/types.ts packages/core/src/evaluation/validation/eval-file.schema.ts packages/core/src/evaluation/loaders/evaluator-parser.ts packages/core/src/evaluation/registry/builtin-evaluators.ts packages/core/test/evaluation/evaluators.test.ts packages/core/test/evaluation/loaders/evaluator-parser.test.ts apps/web/src/content/docs/evaluators/llm-judges.mdx apps/web/src/content/docs/evaluation/examples.mdx examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts plugins/agentv-dev/skills/agentv-eval-builder/references/eval-schema.json
  • bun examples/showcase/offline-judge-benchmark/scripts/replay-fixture-output.ts --prompt $'Task\n<<<AGENT_OUTPUT\nhello world\n>>>AGENT_OUTPUT' --output /tmp/offline-judge-replay.txt
  • bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts --results examples/showcase/offline-judge-benchmark/fixtures/setup-a.raw.jsonl --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl --label judge-setup-a > /tmp/judge-setup-a.scored.jsonl
  • bun examples/showcase/offline-judge-benchmark/scripts/score-judge-benchmark.ts --results examples/showcase/offline-judge-benchmark/fixtures/setup-b.raw.jsonl --dataset examples/showcase/offline-judge-benchmark/fixtures/labeled-judge-export.jsonl --label judge-setup-b > /tmp/judge-setup-b.scored.jsonl
  • bun apps/cli/src/cli.ts compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl (expected regression exit 1 with 0 wins / 2 losses / 3 ties)
  • bun apps/cli/dist/cli.js compare /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl --json (expected regression exit 1)
  • bun examples/features/benchmark-tooling/scripts/benchmark-report.ts /tmp/judge-setup-a.scored.jsonl /tmp/judge-setup-b.scored.jsonl --json

Risk

High — additive public YAML schema change (llm-judge.target) plus new public benchmark workflow example/docs.

Related

  • Orchestration tracked in agentevals-research#1

- add per-llm-judge target overrides for multi-model judge panels\n- add a public offline benchmark example with labeled export fixtures and scoring glue\n- document single-run and A/B comparison workflows

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Mar 13, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 03d2073
Status: ✅  Deploy successful!
Preview URL: https://6cea2290.agentv.pages.dev
Branch Preview URL: https://feat-offline-judge-benchmark.agentv.pages.dev

View logs

Use OpenRouter-backed Claude Haiku and Gemini Flash targets, drop the invalid Gemini Flash Lite example target, and trim the bundled benchmark fixture set to the three judges that are actually configured in the example.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@christso
Copy link
Collaborator Author

Nit: apps/web/src/content/docs/evaluation/examples.mdx still says "Benchmark a five-model judge panel" — should be "three-model" to match the actual YAML (3 judges: gpt-5-mini, claude-haiku, gemini-flash). PR body has been corrected.

@christso christso merged commit c99a749 into main Mar 14, 2026
1 check passed
@christso christso deleted the feat/offline-judge-benchmark branch March 14, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant