ci(eval): use --repeats 3 on gpt-oss-120b gpqa-diamond to suppress noise by qywu · Pull Request #278 · lightseekorg/tokenspeed

qywu · 2026-05-27T03:38:15Z

Summary

Recent flake on PR #272 (CI run 26483245060) failed the eval-gpt-oss-120b-mxfp4-gpqa-diamond check at score=0.6919 vs threshold>=0.7 — off by ~2 questions out of 198. The PR's diff was HTTP plumbing only (no model / kernel / sampling code), so the failure is sampling noise, not a regression.

GPQA Diamond runs all 198 questions once (AveragePass@1, no --limit, no --repeats). With a ~70% pass rate this gives a binomial stddev of:

σ ≈ sqrt(198 × 0.7 × 0.3) / 198 ≈ 0.033

matching the YAML's documented 0.74 ± 0.03 band. The 0.7 threshold sits at ~1.3σ below that mean, so a noise-driven false-fail happens on roughly 1 in 9 clean runs.

Fix

Add --repeats 3 to the evalscope command. evalscope samples each question 3× and averages the per-question scores before aggregating, shrinking per-run stddev by ~1/√3.

	before	after
samples per question	1	3
total generations	198	594
score stddev	~0.033	~0.019
threshold position (σ below mean)	~1.3	~2.1
noise-only false-fail rate	~11%	~2%
eval wall-clock (b200-2gpu)	~2 min	~6 min

The 0.7 threshold is unchanged so a real accuracy regression (a drop of more than ~2σ from the documented mean) is still caught.

Only gpt-oss-120b-mxfp4-evalscope-gpqa-diamond.yaml is touched. The kimi-k2.5 gpqa-diamond eval is nightly-only without a score_threshold — it can't fail the same way and doesn't need the cost.

Test plan

Re-run the same CI commit and confirm the eval passes after the change
Confirm total per-commit CI time increases by ~4 GPU-min (acceptable on b200-2gpu)

This eval recently flaked at score=0.6919 vs threshold>=0.7 (~2 questions short on 198 total). The 0.7 bar sits at ~1.3σ below the documented 0.74 ± 0.03 mean, so ~11% of clean runs land below threshold from binomial sampling noise alone (see PR lightseekorg#272 CI run 26483245060). evalscope's --repeats N runs each prompt N times and averages the scores before aggregating. With --repeats 3 the per-question variance shrinks by ~1/sqrt(3), cutting the score stddev from ~0.033 to ~0.019. At the unchanged 0.7 threshold this moves the false-fail rate from ~11% down to ~2%, at a cost of roughly 4 extra GPU-min per per-commit run. Threshold left at 0.7 so a real accuracy regression (drop of more than ~2σ from the mean) is still caught. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

At the prior --eval-batch-size 16 the server pinned at 13-16 in-flight requests with page_ratio=0.00-0.01, leaving most decode capacity idle (server max_num_seqs=160). Lifting evalscope client concurrency to 64 roughly cancels the cost of the new --repeats 3 by raising throughput, and lets the prefix cache hit on the repeat duplicates for additional TTFT savings. Eval scoring is independent of batch size, so the noise-reduction benefit of --repeats 3 is preserved. Signed-off-by: Qingyang Wu <willqywu@gmail.com>

lightseek-bot

decent

lightseek-bot · 2026-05-27T04:08:44Z

https://github.com/lightseekorg/tokenspeed/actions/runs/26489368179/job/78003904536?pr=278
https://github.com/lightseekorg/tokenspeed/actions/runs/26489368179/job/78003904573?pr=278

qywu added 3 commits May 27, 2026 03:37

Merge branch 'main' into ci/gpqa-diamond-repeats

2136c1c

syuoni approved these changes May 27, 2026

View reviewed changes

lightseek-bot approved these changes May 27, 2026

View reviewed changes

qywu merged commit f475bca into lightseekorg:main May 27, 2026
16 of 30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(eval): use --repeats 3 on gpt-oss-120b gpqa-diamond to suppress noise#278

ci(eval): use --repeats 3 on gpt-oss-120b gpqa-diamond to suppress noise#278
qywu merged 3 commits into
lightseekorg:mainfrom
qywu:ci/gpqa-diamond-repeats

qywu commented May 27, 2026

Uh oh!

lightseek-bot left a comment

Uh oh!

lightseek-bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qywu commented May 27, 2026

Summary

Fix

Test plan

Uh oh!

lightseek-bot left a comment

Choose a reason for hiding this comment

Uh oh!

lightseek-bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants