Skip to content

Bench report labels#28

Merged
swelljoe merged 4 commits into
mainfrom
bench-report-labels
May 31, 2026
Merged

Bench report labels#28
swelljoe merged 4 commits into
mainfrom
bench-report-labels

Conversation

@swelljoe
Copy link
Copy Markdown
Owner

Make the report nicer.

Joe Cooper and others added 4 commits May 31, 2026 03:16
The runtime prefix on competitor labels carries no information in the
report — every non-Claude model runs in the API loop, so `raw-api-loop/`
is just noise, and the Claude models are identifiable by name. Strip the
leading `runtime/` prefix from every leaderboard label.

A few competitors are registered under a bare model nickname; map those
to their versioned product name so every row reads at the same level of
detail: deepseek-v4-pro, mimo-v2.5-pro, haiku-4.5, sonnet-4.6, opus-4.8.

Display-only via a new `_display_name` helper applied at the four render
sites (leaderboard table, per-case matrix rows, both scatter plots, and
the token-bar chart). The stored competitor_name is untouched, so every
DB / frontier-membership / matrix-cell lookup keyed on it is unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mimo (raw-api-loop + retired claude-code) is a 1T+ parameter model →
large, matching deepseek. gemini-3.1-pro-preview → large; the cheaper
gemini-3.5-flash → medium. Brings every active competitor's Size column
in the leaderboard to a concrete value (no more "—").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpt-5.5-pro is a 4/9 cost-capped probe at ~$22.83/case. On the Pareto
scatters it wrecked both axes: the cost axis stretched to $22 so every
full-corpus competitor (all <=$1.45) collapsed into one indistinguishable
cluster, and as the highest-quality point it was drawn on the frontier —
implying a "best" ranking its 4-case probe never established. The
leaderboard asterisk already handles this in the table, but the charts
had no equivalent guard.

Filter the scatter inputs (and the frontier computation that feeds the
table ★) to competitors that audited at least _FRONTIER_MIN_COVERAGE
(75%) of the fullest run's cases. Generic threshold, no hardcoded name:
the near-complete qwen 8/9 runs stay (one is the free cost-frontier
point worth keeping), while the 4/9 probe drops. Excluded competitors
keep their asterisked leaderboard row, and a note under the charts names
who was omitted and why.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The leaderboard carried two footnote-style marks next to a competitor's
name — ★ for "on a Pareto frontier below" and * for "partial coverage" —
which read as the same kind of caveat and were easy to conflate (a small
green star can look asterisk-ish). Move the partial-coverage mark to a
dagger (†), the conventional secondary-footnote symbol, so the two are
unmistakable: ★ = frontier (a positive marker), † = partial coverage (a
caveat). Footnote, legend, and the Pareto-exclusion note updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@swelljoe swelljoe merged commit 9f1b640 into main May 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant