Skip to content

ci(eval): use --repeats 3 on gpt-oss-120b gpqa-diamond to suppress noise#278

Merged
qywu merged 3 commits into
lightseekorg:mainfrom
qywu:ci/gpqa-diamond-repeats
May 27, 2026
Merged

ci(eval): use --repeats 3 on gpt-oss-120b gpqa-diamond to suppress noise#278
qywu merged 3 commits into
lightseekorg:mainfrom
qywu:ci/gpqa-diamond-repeats

Conversation

@qywu
Copy link
Copy Markdown
Collaborator

@qywu qywu commented May 27, 2026

Summary

Recent flake on PR #272 (CI run 26483245060) failed the eval-gpt-oss-120b-mxfp4-gpqa-diamond check at score=0.6919 vs threshold>=0.7 — off by ~2 questions out of 198. The PR's diff was HTTP plumbing only (no model / kernel / sampling code), so the failure is sampling noise, not a regression.

GPQA Diamond runs all 198 questions once (AveragePass@1, no --limit, no --repeats). With a ~70% pass rate this gives a binomial stddev of:

σ ≈ sqrt(198 × 0.7 × 0.3) / 198 ≈ 0.033

matching the YAML's documented 0.74 ± 0.03 band. The 0.7 threshold sits at ~1.3σ below that mean, so a noise-driven false-fail happens on roughly 1 in 9 clean runs.

Fix

Add --repeats 3 to the evalscope command. evalscope samples each question 3× and averages the per-question scores before aggregating, shrinking per-run stddev by ~1/√3.

before after
samples per question 1 3
total generations 198 594
score stddev ~0.033 ~0.019
threshold position (σ below mean) ~1.3 ~2.1
noise-only false-fail rate ~11% ~2%
eval wall-clock (b200-2gpu) ~2 min ~6 min

The 0.7 threshold is unchanged so a real accuracy regression (a drop of more than ~2σ from the documented mean) is still caught.

Only gpt-oss-120b-mxfp4-evalscope-gpqa-diamond.yaml is touched. The kimi-k2.5 gpqa-diamond eval is nightly-only without a score_threshold — it can't fail the same way and doesn't need the cost.

Test plan

  • Re-run the same CI commit and confirm the eval passes after the change
  • Confirm total per-commit CI time increases by ~4 GPU-min (acceptable on b200-2gpu)

qywu added 3 commits May 27, 2026 03:37
This eval recently flaked at score=0.6919 vs threshold>=0.7 (~2 questions
short on 198 total). The 0.7 bar sits at ~1.3σ below the documented
0.74 ± 0.03 mean, so ~11% of clean runs land below threshold from
binomial sampling noise alone (see PR lightseekorg#272 CI run 26483245060).

evalscope's --repeats N runs each prompt N times and averages the scores
before aggregating. With --repeats 3 the per-question variance shrinks
by ~1/sqrt(3), cutting the score stddev from ~0.033 to ~0.019. At the
unchanged 0.7 threshold this moves the false-fail rate from ~11% down
to ~2%, at a cost of roughly 4 extra GPU-min per per-commit run.

Threshold left at 0.7 so a real accuracy regression (drop of more than
~2σ from the mean) is still caught.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
At the prior --eval-batch-size 16 the server pinned at 13-16 in-flight
requests with page_ratio=0.00-0.01, leaving most decode capacity idle
(server max_num_seqs=160). Lifting evalscope client concurrency to 64
roughly cancels the cost of the new --repeats 3 by raising throughput,
and lets the prefix cache hit on the repeat duplicates for additional
TTFT savings.

Eval scoring is independent of batch size, so the noise-reduction
benefit of --repeats 3 is preserved.

Signed-off-by: Qingyang Wu <willqywu@gmail.com>
Copy link
Copy Markdown
Contributor

@lightseek-bot lightseek-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decent

@qywu qywu merged commit f475bca into lightseekorg:main May 27, 2026
16 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants