Skip to content

[specdec_bench] Stratify --num_requests across categories#1389

Open
milesial wants to merge 2 commits intoNVIDIA:mainfrom
milesial:specdec-stratify-num-requests
Open

[specdec_bench] Stratify --num_requests across categories#1389
milesial wants to merge 2 commits intoNVIDIA:mainfrom
milesial:specdec-stratify-num-requests

Conversation

@milesial
Copy link
Copy Markdown

@milesial milesial commented May 4, 2026

When the dataset has a category column with >1 distinct categories and --num_requests N is below the dataset size, take ceil(N / num_categories) rows from each category and round-robin interleave them so any prefix is balanced. Falls back to the existing range(N) slice when category metadata is absent or there's only one category.

Fixes a sampling bug where SPEED-Bench parquet files are sorted by category, so e.g. --num_requests 64 on throughput_8k pulls 64 high_entropy prompts and zero from low_entropy / mixed.

Summary by CodeRabbit

  • New Features
    • SPEEDBench dataset truncation now performs deterministic, balanced stratified sampling across categories (round-robin) when multiple categories exist and a smaller sample size is requested.
    • Falls back to simple prefix selection when no category or only one category is present, preserving prior behavior in that case.

When the dataset has a `category` column with >1 distinct categories
and ``--num_requests N`` is below the dataset size, take ceil(N /
num_categories) rows from each category and round-robin interleave
them so any prefix is balanced. Falls back to the existing
``range(N)`` slice when category metadata is absent or there's only
one category.

Fixes a sampling bug where SPEED-Bench parquet files are sorted by
category, so e.g. ``--num_requests 64`` on throughput_8k pulls 64
high_entropy prompts and zero from low_entropy / mixed.

Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com>
@milesial milesial requested a review from a team as a code owner May 4, 2026 21:49
@milesial milesial requested a review from ChenhanYu May 4, 2026 21:49
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 4, 2026

📝 Walkthrough

Walkthrough

SPEEDBench dataset loading now only truncates when num_samples is set and smaller than the dataset. When truncating, if a multi-valued category column exists it deterministically round-robins rows across categories to collect exactly n samples; otherwise it falls back to selecting the first n rows.

Changes

Stratified Dataset Sampling

Layer / File(s) Summary
Precondition / Truncation Gate
examples/specdec_bench/specdec_bench/datasets/speed.py (around lines 740–743)
Truncation is applied only when self.num_samples is not None and strictly less than len(dataset); previous unconditional prefix truncation was removed.
Core Selection Logic
examples/specdec_bench/specdec_bench/datasets/speed.py (new method lines ~745–773)
Adds SPEEDBench._stratified_select(dataset: "Dataset", n: int) that: if category is absent or contains a single distinct value, returns dataset.select(range(n)); otherwise builds per-category index lists in original order and deterministically collects exactly n rows by round-robin interleaving across categories until n indices are gathered.
Integration
examples/specdec_bench/specdec_bench/datasets/speed.py (caller around lines 740–743)
_load_dataset(...) delegates to self._stratified_select(dataset, self.num_samples) when truncation is needed.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 6
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: implementing stratified sampling of requests across dataset categories in the SPEED-Bench dataset loader.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Pull request does not introduce any of the six critical security anti-patterns specified in SECURITY.md.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 765-777: The interleaving logic can produce fewer than n items
when some categories have fewer than per_cat rows; update the loop that builds
interleaved (currently using per_cat and cat_samples) to backfill by repeatedly
cycling through cat_samples and appending the next available item from each
category until interleaved has length n or all samples are exhausted, i.e.,
maintain per-category indices (or pop from lists) and in a while
len(interleaved) < n loop iterate over cat_samples appending one item per
non-empty category, then return dataset.select(interleaved[:n]) to guarantee
num_samples.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f07d27e0-5726-4c6e-ae35-bffedcb82b90

📥 Commits

Reviewing files that changed from the base of the PR and between 06ef935 and 9d05c40.

📒 Files selected for processing (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

Comment thread examples/specdec_bench/specdec_bench/datasets/speed.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.29%. Comparing base (06ef935) to head (29cd86f).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1389      +/-   ##
==========================================
+ Coverage   76.90%   77.29%   +0.39%     
==========================================
  Files         471      471              
  Lines       50565    50565              
==========================================
+ Hits        38885    39083     +198     
+ Misses      11680    11482     -198     
Flag Coverage Δ
examples 40.21% <ø> (-0.46%) ⬇️
unit 52.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@h-guo18 h-guo18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at CodeRabbit’s comment above. Other than that LGTM

Replace the two-step (per_cat clamp + interleave) selection with a
single-pass round-robin walk over the per-category index lists.
Smaller categories continue contributing until exhausted; larger
categories then fill the remainder. This guarantees exactly ``n``
rows whenever ``n`` does not exceed the total dataset size, where
the previous code could under-deliver if any category had fewer
than ``ceil(n / num_categories)`` rows.

The first ``n`` rows are still balanced (drop the same ``ceil`` /
``per_cat`` quantity); the only change is what happens past the
exhaustion point of the smallest category. ``math`` is no longer
needed.

Signed-off-by: Alexandre Milesi <milesial@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 745-772: In _stratified_select, handle non-positive n early: if n
<= 0 return dataset.select([]) (or an empty index sequence) before any category
processing so the round-robin loop never appends indices for n==0 or negative n;
locate the function _stratified_select and add this early guard immediately
after the docstring (before checking "category" in dataset.column_names) so all
subsequent logic assumes n > 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 38099a4c-3298-464c-901c-7a504580c435

📥 Commits

Reviewing files that changed from the base of the PR and between 9d05c40 and 29cd86f.

📒 Files selected for processing (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

Comment thread examples/specdec_bench/specdec_bench/datasets/speed.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants