Bench repeat trials by swelljoe · Pull Request #29 · swelljoe/nelson

swelljoe · 2026-06-01T01:42:24Z

Add --repeat feature, so we can do multiple runs with the same prompt, etc. across all models. Since there is definitely some noise in results, replication of results is valuable and provides useful data.

Also fixes the fuzzy specifier for DeepSeek, which used deepseek-v4-pro almost exclusively yesterday, but for some reason only uses deepseek-v4-flash today...we want to know which model we're using at all times. They perform pretty differently, as expected.

Single-run detection is noisy: the same hosted model can score 4/9 one pass and 2/9 the next (empty findings on marginal cases), so n=1 per cell cannot rank cheap models. Add repeated independent trials per cell plus a per-trial variance report. - schema v6: runs.trial (0-based repeat index), per-(case,competitor,file, trial) index; non-destructive ADD COLUMN, legacy runs default to trial 0. - plan_matrix(repeat=N): plans N trials per cell, resumes per-trial so a partial pass fills exactly the missing trials (and bumping N later only adds the new trials). - thread trial through Runner.run_case / create_run / RunScore. - score.noise_report(): rolls file-runs up to a (competitor, case, trial) outcome, then reports each trial's detection rate, spread (max-min), and per-case hit-frequency, flagging flaky cases (hit some trials, missed others). Single-trial data degrades to spread 0. - `bench loop --repeat N` and a read-only `bench noise` report command. - .gitignore: experimental datasets (nelson-*.db) + bench-exp*.html. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…waps models) `deepseek-chat` is a server-side alias DeepSeek routes to deepseek-v4-pro OR deepseek-v4-flash at its own discretion — observed pro on 2026-05-30 and flash on 2026-05-31 with Nelson the only key user. That silently swaps the model under a fixed label and confounds results (part of the deepseek detection drop between datasets is plausibly a pro->flash swap, not sampling noise). /models exposes exactly two real ids: deepseek-v4-pro, deepseek-v4-flash. Pin both as separate competitors (raw-api-loop/deepseek-pro, -flash) so the tier is explicit and the two can be compared directly. Pricing left VERIFY-AT-WIRING. Leaderboard labels: deepseek-pro -> deepseek-v4-pro, deepseek-flash -> deepseek-v4-flash; the legacy unpinned `raw-api-loop/deepseek` rows are relabelled "deepseek-v4 (alias)" to flag they are a tier mix, not clean pro. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Real DeepSeek flash rates (verified 2026-05-31 portal): input is 0.14 on a cache miss but 0.0028 on a cache hit (50x cheaper), output 0.28. The cost_model input rate is the miss/list rate; input_cache_hit_usd_per_mtok is recorded but not yet consumed (compute_cost is cache-blind). Because DeepSeek caches aggressively and the ReAct loop resends the whole context each turn, recorded DeepSeek cost overcounts the real bill heavily until cache-aware costing lands. Pro pricing still VERIFY-AT-WIRING. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…0.87 (promo) Promotional pro rates (verified 2026-05-31 portal; use until the promo ends). 3x flash's output and ~3x its miss-input — the prior 0.14/0.28 placeholder under-priced pro. Matches the existing claude-code/deepseek (0.435/0.87) entry. input rate is the cache-miss/list rate; input_cache_hit_usd_per_mtok recorded for cache-aware costing (compute_cost is still cache-blind, so recorded cost overcounts until that lands). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds support for repeated benchmark trials (--repeat) so run-to-run variance can be measured and reported per competitor, and pins DeepSeek competitors to explicit tiers to avoid silent model routing changes.

Changes:

Add trial (0-based repeat index) to persisted runs and propagate it through planning/execution/scoring.
Introduce a bench noise report (and scoring helper) to summarize per-trial detection variance and flaky cases.
Update DeepSeek competitor naming/mapping to distinguish pinned pro vs flash, and adjust examples/ignore patterns.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`tests/test_score.py`	Adds unit tests for the new `noise_report` behavior (per-trial rollup, flaky detection, single-trial spread).
`tests/test_automate.py`	Extends planning/resume tests to cover `repeat` trials and adds `trial` to existing-run indexing fixtures.
`nelson/score.py`	Adds `RunScore.trial` and implements `noise_report`/`CompetitorNoise` for variance reporting.
`nelson/runner.py`	Propagates `trial` into run creation so repeated runs are distinguishable in the DB.
`nelson/html_report.py`	Pins DeepSeek display names and flags the legacy alias as mixed-tier.
`nelson/db.py`	Bumps schema to v6, adds `runs.trial`, and indexes runs by `(case, competitor, file, trial)`.
`nelson/cli.py`	Adds `bench noise` command and adds `--repeat` option to `bench loop`.
`nelson/automate.py`	Plans/executes `(competitor, case, file, trial)` cells and resumes per-trial.
`competitors.example.yaml`	Splits DeepSeek into pinned `deepseek-pro` and `deepseek-flash` competitors; documents alias risk.
`.gitignore`	Ignores experimental DB/report artifacts (e.g., `nelson-.db`, `bench-exp.html`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-06-01T02:01:52Z

                    if (
-                        version == 5
-                        and "duplicate column name: target_file" in str(exc).lower()
+                        version in (5, 6)
+                        and "duplicate column name" in str(exc).lower()
                    ):
                        continue


Fixed in the latest commit. When migration 6's executescript fails with "duplicate column name", the handler now re-runs each statement individually — skipping only the ALTER TABLE duplicate — so that the CREATE INDEX IF NOT EXISTS idx_runs_trial statement still gets applied.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…licate column

Joe Cooper and others added 4 commits May 31, 2026 14:49

swelljoe requested a review from Copilot June 1, 2026 01:42

Copilot started reviewing on behalf of swelljoe June 1, 2026 01:42 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Joe Cooper and others added 2 commits May 31, 2026 20:56

Format

f5f8b80

Potential fix for pull request finding

b4e0161

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of swelljoe June 1, 2026 01:58 View session

Fix migration 6 error handler to apply remaining statements after dup…

7eab46a

…licate column

Copilot finished work on behalf of swelljoe June 1, 2026 02:03

Fix bogus syntax

59d62de

swelljoe merged commit b435edf into main Jun 1, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench repeat trials#29

Bench repeat trials#29
swelljoe merged 8 commits into
mainfrom
bench-repeat-trials

swelljoe commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

swelljoe commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants