Skip to content

Add gpt-5.5-pro competitor + partial-coverage asterisk in leaderboard#27

Merged
swelljoe merged 3 commits into
mainfrom
bench-gpt5.5-pro
May 31, 2026
Merged

Add gpt-5.5-pro competitor + partial-coverage asterisk in leaderboard#27
swelljoe merged 3 commits into
mainfrom
bench-gpt5.5-pro

Conversation

@swelljoe
Copy link
Copy Markdown
Owner

Add openai/gpt-5.5-pro ($30/$180) as an OpenRouter raw-api-loop competitor. A cost-capped probe (4 of 9 cases, $91.30) found it has a spiky edge: it uniquely cracked GHSA-9f49 (CWE-416 use-after-free) that 0/16 other competitors found, but missed a case a cheaper model got — not a default-include at ~9-20x the cost-per-case.

Because pro ran a partial denominator, its detection rate (2/4 = 50%) is not rank-comparable with full-corpus competitors. Flag any competitor that completed fewer than the fullest run's case count with a footnoted asterisk in the HTML leaderboard (on both the name and the Detect cell), so a partial-coverage rate is not misread as a rank. Principled, no hardcoded names: the two qwen 8/9 runs are flagged too.

Add openai/gpt-5.5-pro ($30/$180) as an OpenRouter raw-api-loop competitor.
A cost-capped probe (4 of 9 cases, $91.30) found it has a spiky edge: it
uniquely cracked GHSA-9f49 (CWE-416 use-after-free) that 0/16 other
competitors found, but missed a case a cheaper model got — not a
default-include at ~9-20x the cost-per-case.

Because pro ran a partial denominator, its detection rate (2/4 = 50%) is
not rank-comparable with full-corpus competitors. Flag any competitor that
completed fewer than the fullest run's case count with a footnoted asterisk
in the HTML leaderboard (on both the name and the Detect cell), so a
partial-coverage rate is not misread as a rank. Principled, no hardcoded
names: the two qwen 8/9 runs are flagged too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new OpenRouter competitor entry for openai/gpt-5.5-pro and updates the HTML leaderboard to visually flag competitors whose results are based on fewer audited cases than the fullest run, to prevent partial-coverage detection rates being misread as rank-comparable.

Changes:

  • Add raw-api-loop/gpt-5.5-pro to competitors.example.yaml with OpenRouter pricing metadata.
  • Compute the maximum audited case count across leaderboard entries and append a footnoted asterisk to the Competitor name and Detect cells for partial-coverage rows.
  • Add an explanatory footnote to the leaderboard when partial-coverage entries are present.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
nelson/html_report.py Adds partial-coverage detection-rate asterisk markers and a footnote to the leaderboard HTML output.
competitors.example.yaml Adds raw-api-loop/gpt-5.5-pro competitor configuration (model id, auth profile, pricing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread nelson/html_report.py
Comment thread nelson/html_report.py
swelljoe and others added 2 commits May 31, 2026 03:59
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@swelljoe swelljoe merged commit 5a56001 into main May 31, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants