Skip to content

Fix weekly eval gate replay comparison#2302

Open
uncfreak1255-code wants to merge 19 commits into
garrytan:masterfrom
uncfreak1255-code:codex/weekly-eval-regression-drill-pr
Open

Fix weekly eval gate replay comparison#2302
uncfreak1255-code wants to merge 19 commits into
garrytan:masterfrom
uncfreak1255-code:codex/weekly-eval-regression-drill-pr

Conversation

@uncfreak1255-code

Copy link
Copy Markdown

Summary

  • Add gbrain eval gate --drill for privacy-safe regression drivers: query hashes, counts, Jaccard, top-1 status, and latency only.
  • Make regression gate replay compare current results at each row's captured result count by default.
  • Document --drill and --compare-limit N|captured in the eval bench guide.

Proof

  • bun test test/eval-replay.test.ts
  • bun test test/eval-gate.test.ts
  • bun run typecheck
  • Live source-checkout receipt: ~/.gbrain/eval-receipts/gbrain-weekly-quality-streak/20260619-gate-drill-clean-pr.json
  • Live result: verdict pass, rows 31/31, mean_jaccard 0.868663594470046, top1 0.8387096774193549, no breaches.

Notes

  • bun run ci:local:diff was not run because gitleaks is not installed on PATH; Docker is available.
  • The prior installed weekly loop still has streak 0; rerun after merge/install before updating the streak.
  • The doctor JSONB progress work is intentionally not included in this PR.

uncfreak1255-code and others added 19 commits June 17, 2026 15:48
…onb-regression-tests

test: cover migrate source JSONB config shape
…-pr23

docs: keep release privacy guidance generic
…e-jsonb-ci-lane

Add migrate source JSONB regression to Postgres E2E CI
…ricing-clean

[codex] Pin DeepSeek V4 pricing and aliases
…epseek-budget-gate

Add monthly Claude and DeepSeek budget gate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant