Bench repeat trials#29
Conversation
Single-run detection is noisy: the same hosted model can score 4/9 one pass and 2/9 the next (empty findings on marginal cases), so n=1 per cell cannot rank cheap models. Add repeated independent trials per cell plus a per-trial variance report. - schema v6: runs.trial (0-based repeat index), per-(case,competitor,file, trial) index; non-destructive ADD COLUMN, legacy runs default to trial 0. - plan_matrix(repeat=N): plans N trials per cell, resumes per-trial so a partial pass fills exactly the missing trials (and bumping N later only adds the new trials). - thread trial through Runner.run_case / create_run / RunScore. - score.noise_report(): rolls file-runs up to a (competitor, case, trial) outcome, then reports each trial's detection rate, spread (max-min), and per-case hit-frequency, flagging flaky cases (hit some trials, missed others). Single-trial data degrades to spread 0. - `bench loop --repeat N` and a read-only `bench noise` report command. - .gitignore: experimental datasets (nelson-*.db) + bench-exp*.html. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…waps models) `deepseek-chat` is a server-side alias DeepSeek routes to deepseek-v4-pro OR deepseek-v4-flash at its own discretion — observed pro on 2026-05-30 and flash on 2026-05-31 with Nelson the only key user. That silently swaps the model under a fixed label and confounds results (part of the deepseek detection drop between datasets is plausibly a pro->flash swap, not sampling noise). /models exposes exactly two real ids: deepseek-v4-pro, deepseek-v4-flash. Pin both as separate competitors (raw-api-loop/deepseek-pro, -flash) so the tier is explicit and the two can be compared directly. Pricing left VERIFY-AT-WIRING. Leaderboard labels: deepseek-pro -> deepseek-v4-pro, deepseek-flash -> deepseek-v4-flash; the legacy unpinned `raw-api-loop/deepseek` rows are relabelled "deepseek-v4 (alias)" to flag they are a tier mix, not clean pro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Real DeepSeek flash rates (verified 2026-05-31 portal): input is 0.14 on a cache miss but 0.0028 on a cache hit (50x cheaper), output 0.28. The cost_model input rate is the miss/list rate; input_cache_hit_usd_per_mtok is recorded but not yet consumed (compute_cost is cache-blind). Because DeepSeek caches aggressively and the ReAct loop resends the whole context each turn, recorded DeepSeek cost overcounts the real bill heavily until cache-aware costing lands. Pro pricing still VERIFY-AT-WIRING. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…0.87 (promo) Promotional pro rates (verified 2026-05-31 portal; use until the promo ends). 3x flash's output and ~3x its miss-input — the prior 0.14/0.28 placeholder under-priced pro. Matches the existing claude-code/deepseek (0.435/0.87) entry. input rate is the cache-miss/list rate; input_cache_hit_usd_per_mtok recorded for cache-aware costing (compute_cost is still cache-blind, so recorded cost overcounts until that lands). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds support for repeated benchmark trials (--repeat) so run-to-run variance can be measured and reported per competitor, and pins DeepSeek competitors to explicit tiers to avoid silent model routing changes.
Changes:
- Add
trial(0-based repeat index) to persisted runs and propagate it through planning/execution/scoring. - Introduce a
bench noisereport (and scoring helper) to summarize per-trial detection variance and flaky cases. - Update DeepSeek competitor naming/mapping to distinguish pinned
provsflash, and adjust examples/ignore patterns.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_score.py |
Adds unit tests for the new noise_report behavior (per-trial rollup, flaky detection, single-trial spread). |
tests/test_automate.py |
Extends planning/resume tests to cover repeat trials and adds trial to existing-run indexing fixtures. |
nelson/score.py |
Adds RunScore.trial and implements noise_report/CompetitorNoise for variance reporting. |
nelson/runner.py |
Propagates trial into run creation so repeated runs are distinguishable in the DB. |
nelson/html_report.py |
Pins DeepSeek display names and flags the legacy alias as mixed-tier. |
nelson/db.py |
Bumps schema to v6, adds runs.trial, and indexes runs by (case, competitor, file, trial). |
nelson/cli.py |
Adds bench noise command and adds --repeat option to bench loop. |
nelson/automate.py |
Plans/executes (competitor, case, file, trial) cells and resumes per-trial. |
competitors.example.yaml |
Splits DeepSeek into pinned deepseek-pro and deepseek-flash competitors; documents alias risk. |
.gitignore |
Ignores experimental DB/report artifacts (e.g., nelson-*.db, bench-exp*.html). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if ( | ||
| version == 5 | ||
| and "duplicate column name: target_file" in str(exc).lower() | ||
| version in (5, 6) | ||
| and "duplicate column name" in str(exc).lower() | ||
| ): | ||
| continue |
There was a problem hiding this comment.
Fixed in the latest commit. When migration 6's executescript fails with "duplicate column name", the handler now re-runs each statement individually — skipping only the ALTER TABLE duplicate — so that the CREATE INDEX IF NOT EXISTS idx_runs_trial statement still gets applied.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Add
--repeatfeature, so we can do multiple runs with the same prompt, etc. across all models. Since there is definitely some noise in results, replication of results is valuable and provides useful data.Also fixes the fuzzy specifier for DeepSeek, which used deepseek-v4-pro almost exclusively yesterday, but for some reason only uses
deepseek-v4-flashtoday...we want to know which model we're using at all times. They perform pretty differently, as expected.