Add gpt-5.5-pro competitor + partial-coverage asterisk in leaderboard by swelljoe · Pull Request #27 · swelljoe/nelson

swelljoe · 2026-05-31T07:54:12Z

Add openai/gpt-5.5-pro ($30/$180) as an OpenRouter raw-api-loop competitor. A cost-capped probe (4 of 9 cases, $91.30) found it has a spiky edge: it uniquely cracked GHSA-9f49 (CWE-416 use-after-free) that 0/16 other competitors found, but missed a case a cheaper model got — not a default-include at ~9-20x the cost-per-case.

Because pro ran a partial denominator, its detection rate (2/4 = 50%) is not rank-comparable with full-corpus competitors. Flag any competitor that completed fewer than the fullest run's case count with a footnoted asterisk in the HTML leaderboard (on both the name and the Detect cell), so a partial-coverage rate is not misread as a rank. Principled, no hardcoded names: the two qwen 8/9 runs are flagged too.

Add openai/gpt-5.5-pro ($30/$180) as an OpenRouter raw-api-loop competitor. A cost-capped probe (4 of 9 cases, $91.30) found it has a spiky edge: it uniquely cracked GHSA-9f49 (CWE-416 use-after-free) that 0/16 other competitors found, but missed a case a cheaper model got — not a default-include at ~9-20x the cost-per-case. Because pro ran a partial denominator, its detection rate (2/4 = 50%) is not rank-comparable with full-corpus competitors. Flag any competitor that completed fewer than the fullest run's case count with a footnoted asterisk in the HTML leaderboard (on both the name and the Detect cell), so a partial-coverage rate is not misread as a rank. Principled, no hardcoded names: the two qwen 8/9 runs are flagged too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new OpenRouter competitor entry for openai/gpt-5.5-pro and updates the HTML leaderboard to visually flag competitors whose results are based on fewer audited cases than the fullest run, to prevent partial-coverage detection rates being misread as rank-comparable.

Changes:

Add raw-api-loop/gpt-5.5-pro to competitors.example.yaml with OpenRouter pricing metadata.
Compute the maximum audited case count across leaderboard entries and append a footnoted asterisk to the Competitor name and Detect cells for partial-coverage rows.
Add an explanatory footnote to the leaderboard when partial-coverage entries are present.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
nelson/html_report.py	Adds partial-coverage detection-rate asterisk markers and a footnote to the leaderboard HTML output.
competitors.example.yaml	Adds `raw-api-loop/gpt-5.5-pro` competitor configuration (model id, auth profile, pricing).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

swelljoe requested a review from Copilot May 31, 2026 07:55

Copilot started reviewing on behalf of swelljoe May 31, 2026 07:55 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

Comment thread nelson/html_report.py

Comment thread nelson/html_report.py

swelljoe and others added 2 commits May 31, 2026 03:59

Potential fix for pull request finding

06ea1a2

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

112fd54

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

swelljoe merged commit 5a56001 into main May 31, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add gpt-5.5-pro competitor + partial-coverage asterisk in leaderboard#27

Add gpt-5.5-pro competitor + partial-coverage asterisk in leaderboard#27
swelljoe merged 3 commits into
mainfrom
bench-gpt5.5-pro

swelljoe commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swelljoe commented May 31, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants