Bench report labels by swelljoe · Pull Request #28 · swelljoe/nelson

swelljoe · 2026-05-31T08:36:50Z

Make the report nicer.

The runtime prefix on competitor labels carries no information in the report — every non-Claude model runs in the API loop, so `raw-api-loop/` is just noise, and the Claude models are identifiable by name. Strip the leading `runtime/` prefix from every leaderboard label. A few competitors are registered under a bare model nickname; map those to their versioned product name so every row reads at the same level of detail: deepseek-v4-pro, mimo-v2.5-pro, haiku-4.5, sonnet-4.6, opus-4.8. Display-only via a new `_display_name` helper applied at the four render sites (leaderboard table, per-case matrix rows, both scatter plots, and the token-bar chart). The stored competitor_name is untouched, so every DB / frontier-membership / matrix-cell lookup keyed on it is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mimo (raw-api-loop + retired claude-code) is a 1T+ parameter model → large, matching deepseek. gemini-3.1-pro-preview → large; the cheaper gemini-3.5-flash → medium. Brings every active competitor's Size column in the leaderboard to a concrete value (no more "—"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gpt-5.5-pro is a 4/9 cost-capped probe at ~$22.83/case. On the Pareto scatters it wrecked both axes: the cost axis stretched to $22 so every full-corpus competitor (all <=$1.45) collapsed into one indistinguishable cluster, and as the highest-quality point it was drawn on the frontier — implying a "best" ranking its 4-case probe never established. The leaderboard asterisk already handles this in the table, but the charts had no equivalent guard. Filter the scatter inputs (and the frontier computation that feeds the table ★) to competitors that audited at least _FRONTIER_MIN_COVERAGE (75%) of the fullest run's cases. Generic threshold, no hardcoded name: the near-complete qwen 8/9 runs stay (one is the free cost-frontier point worth keeping), while the 4/9 probe drops. Excluded competitors keep their asterisked leaderboard row, and a note under the charts names who was omitted and why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The leaderboard carried two footnote-style marks next to a competitor's name — ★ for "on a Pareto frontier below" and * for "partial coverage" — which read as the same kind of caveat and were easy to conflate (a small green star can look asterisk-ish). Move the partial-coverage mark to a dagger (†), the conventional secondary-footnote symbol, so the two are unmistakable: ★ = frontier (a positive marker), † = partial coverage (a caveat). Footnote, legend, and the Pareto-exclusion note updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Joe Cooper and others added 4 commits May 31, 2026 03:16

swelljoe merged commit 9f1b640 into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bench report labels#28

Bench report labels#28
swelljoe merged 4 commits into
mainfrom
bench-report-labels

swelljoe commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

swelljoe commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant