Skip to content

Bench repeat trials#29

Merged
swelljoe merged 8 commits into
mainfrom
bench-repeat-trials
Jun 1, 2026
Merged

Bench repeat trials#29
swelljoe merged 8 commits into
mainfrom
bench-repeat-trials

Conversation

@swelljoe
Copy link
Copy Markdown
Owner

@swelljoe swelljoe commented Jun 1, 2026

Add --repeat feature, so we can do multiple runs with the same prompt, etc. across all models. Since there is definitely some noise in results, replication of results is valuable and provides useful data.

Also fixes the fuzzy specifier for DeepSeek, which used deepseek-v4-pro almost exclusively yesterday, but for some reason only uses deepseek-v4-flash today...we want to know which model we're using at all times. They perform pretty differently, as expected.

Joe Cooper and others added 4 commits May 31, 2026 14:49
Single-run detection is noisy: the same hosted model can score 4/9 one
pass and 2/9 the next (empty findings on marginal cases), so n=1 per cell
cannot rank cheap models. Add repeated independent trials per cell plus a
per-trial variance report.

- schema v6: runs.trial (0-based repeat index), per-(case,competitor,file,
  trial) index; non-destructive ADD COLUMN, legacy runs default to trial 0.
- plan_matrix(repeat=N): plans N trials per cell, resumes per-trial so a
  partial pass fills exactly the missing trials (and bumping N later only
  adds the new trials).
- thread trial through Runner.run_case / create_run / RunScore.
- score.noise_report(): rolls file-runs up to a (competitor, case, trial)
  outcome, then reports each trial's detection rate, spread (max-min), and
  per-case hit-frequency, flagging flaky cases (hit some trials, missed
  others). Single-trial data degrades to spread 0.
- `bench loop --repeat N` and a read-only `bench noise` report command.
- .gitignore: experimental datasets (nelson-*.db) + bench-exp*.html.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…waps models)

`deepseek-chat` is a server-side alias DeepSeek routes to deepseek-v4-pro OR
deepseek-v4-flash at its own discretion — observed pro on 2026-05-30 and flash
on 2026-05-31 with Nelson the only key user. That silently swaps the model
under a fixed label and confounds results (part of the deepseek detection drop
between datasets is plausibly a pro->flash swap, not sampling noise).

/models exposes exactly two real ids: deepseek-v4-pro, deepseek-v4-flash. Pin
both as separate competitors (raw-api-loop/deepseek-pro, -flash) so the tier is
explicit and the two can be compared directly. Pricing left VERIFY-AT-WIRING.

Leaderboard labels: deepseek-pro -> deepseek-v4-pro, deepseek-flash ->
deepseek-v4-flash; the legacy unpinned `raw-api-loop/deepseek` rows are
relabelled "deepseek-v4 (alias)" to flag they are a tier mix, not clean pro.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Real DeepSeek flash rates (verified 2026-05-31 portal): input is 0.14 on a
cache miss but 0.0028 on a cache hit (50x cheaper), output 0.28. The cost_model
input rate is the miss/list rate; input_cache_hit_usd_per_mtok is recorded but
not yet consumed (compute_cost is cache-blind). Because DeepSeek caches
aggressively and the ReAct loop resends the whole context each turn, recorded
DeepSeek cost overcounts the real bill heavily until cache-aware costing lands.
Pro pricing still VERIFY-AT-WIRING.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0.87 (promo)

Promotional pro rates (verified 2026-05-31 portal; use until the promo ends).
3x flash's output and ~3x its miss-input — the prior 0.14/0.28 placeholder
under-priced pro. Matches the existing claude-code/deepseek (0.435/0.87) entry.
input rate is the cache-miss/list rate; input_cache_hit_usd_per_mtok recorded
for cache-aware costing (compute_cost is still cache-blind, so recorded cost
overcounts until that lands).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for repeated benchmark trials (--repeat) so run-to-run variance can be measured and reported per competitor, and pins DeepSeek competitors to explicit tiers to avoid silent model routing changes.

Changes:

  • Add trial (0-based repeat index) to persisted runs and propagate it through planning/execution/scoring.
  • Introduce a bench noise report (and scoring helper) to summarize per-trial detection variance and flaky cases.
  • Update DeepSeek competitor naming/mapping to distinguish pinned pro vs flash, and adjust examples/ignore patterns.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/test_score.py Adds unit tests for the new noise_report behavior (per-trial rollup, flaky detection, single-trial spread).
tests/test_automate.py Extends planning/resume tests to cover repeat trials and adds trial to existing-run indexing fixtures.
nelson/score.py Adds RunScore.trial and implements noise_report/CompetitorNoise for variance reporting.
nelson/runner.py Propagates trial into run creation so repeated runs are distinguishable in the DB.
nelson/html_report.py Pins DeepSeek display names and flags the legacy alias as mixed-tier.
nelson/db.py Bumps schema to v6, adds runs.trial, and indexes runs by (case, competitor, file, trial).
nelson/cli.py Adds bench noise command and adds --repeat option to bench loop.
nelson/automate.py Plans/executes (competitor, case, file, trial) cells and resumes per-trial.
competitors.example.yaml Splits DeepSeek into pinned deepseek-pro and deepseek-flash competitors; documents alias risk.
.gitignore Ignores experimental DB/report artifacts (e.g., nelson-*.db, bench-exp*.html).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nelson/db.py
Comment on lines 320 to 324
if (
version == 5
and "duplicate column name: target_file" in str(exc).lower()
version in (5, 6)
and "duplicate column name" in str(exc).lower()
):
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in the latest commit. When migration 6's executescript fails with "duplicate column name", the handler now re-runs each statement individually — skipping only the ALTER TABLE duplicate — so that the CREATE INDEX IF NOT EXISTS idx_runs_trial statement still gets applied.

Comment thread nelson/cli.py
Joe Cooper and others added 2 commits May 31, 2026 20:56
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@swelljoe swelljoe merged commit b435edf into main Jun 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants