Bench report labels#28
Merged
Merged
Conversation
The runtime prefix on competitor labels carries no information in the report — every non-Claude model runs in the API loop, so `raw-api-loop/` is just noise, and the Claude models are identifiable by name. Strip the leading `runtime/` prefix from every leaderboard label. A few competitors are registered under a bare model nickname; map those to their versioned product name so every row reads at the same level of detail: deepseek-v4-pro, mimo-v2.5-pro, haiku-4.5, sonnet-4.6, opus-4.8. Display-only via a new `_display_name` helper applied at the four render sites (leaderboard table, per-case matrix rows, both scatter plots, and the token-bar chart). The stored competitor_name is untouched, so every DB / frontier-membership / matrix-cell lookup keyed on it is unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mimo (raw-api-loop + retired claude-code) is a 1T+ parameter model → large, matching deepseek. gemini-3.1-pro-preview → large; the cheaper gemini-3.5-flash → medium. Brings every active competitor's Size column in the leaderboard to a concrete value (no more "—"). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
gpt-5.5-pro is a 4/9 cost-capped probe at ~$22.83/case. On the Pareto scatters it wrecked both axes: the cost axis stretched to $22 so every full-corpus competitor (all <=$1.45) collapsed into one indistinguishable cluster, and as the highest-quality point it was drawn on the frontier — implying a "best" ranking its 4-case probe never established. The leaderboard asterisk already handles this in the table, but the charts had no equivalent guard. Filter the scatter inputs (and the frontier computation that feeds the table ★) to competitors that audited at least _FRONTIER_MIN_COVERAGE (75%) of the fullest run's cases. Generic threshold, no hardcoded name: the near-complete qwen 8/9 runs stay (one is the free cost-frontier point worth keeping), while the 4/9 probe drops. Excluded competitors keep their asterisked leaderboard row, and a note under the charts names who was omitted and why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The leaderboard carried two footnote-style marks next to a competitor's name — ★ for "on a Pareto frontier below" and * for "partial coverage" — which read as the same kind of caveat and were easy to conflate (a small green star can look asterisk-ish). Move the partial-coverage mark to a dagger (†), the conventional secondary-footnote symbol, so the two are unmistakable: ★ = frontier (a positive marker), † = partial coverage (a caveat). Footnote, legend, and the Pareto-exclusion note updated to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make the report nicer.