Skip to content

Support DP sampling for spec decode#232

Open
yubofredwang wants to merge 41 commits into
mainfrom
ywang/dp-sampling
Open

Support DP sampling for spec decode#232
yubofredwang wants to merge 41 commits into
mainfrom
ywang/dp-sampling

Conversation

@yubofredwang
Copy link
Copy Markdown
Contributor

@yubofredwang yubofredwang commented May 24, 2026

Summary

Enable data parallele sampling. This has a few benefits:

  1. No need to use deterministic kernel during larger batch. Potentially better perf.
  2. Reduce the all gather communication overhead by calculating needed part and use all to all single operation.
  3. Avoid duplicate work from ranks in TP mode. No need to sync verify results between ranks, but rather, gather the results.

Implemented:

  • swap vocab and batch dimension
  • onesided all2all kernel
  • padding for cuda graph and even split
  • Support logprobs
  • Support xgrammar

Test Plan

Ran Kimi-K2.5 TP8 FP8 KV, batch 40, EAGLE3 spec 5/1/6, ctx=5000, max_tokens=4096, comparing TP sampling baseline vs DP sampling onesided.

mode | round 1 tok/s | round 2 tok/s | round 3 tok/s | avg rounds 2-3
-- | -- | -- | -- | --
TP sampling | 4475.0 | 5227.5 | 4956.8 | 5092.2
DP sampling | 4333.1 | 5543.3 | 5238.8 | 5391.1

DP sampling was about +5.9% on warm rounds 2-3.

Run | Eval batch | DP gate | Route coverage | AIME score
-- | -- | -- | -- | --
Initial AIME | 16 | default min_bs=16 | bs=16 used DP sampling; smaller tail batches fell back | 0.9667
Large batch AIME | 30 | default min_bs=16 | bs=30 through bs=16 used DP sampling; bs<16 fell back | 0.9000
Full DP AIME | 30 | min_bs=1 | all decode batch sizes from bs=30 down to bs=1 used DP sampling | 0.9333

Follow up

  • Support better open threshold. Things that determine the minimum bsz to enable dp sampling are:
  1. Batch size
  2. Vocab size
  3. TP size
  4. spec draft number
  5. DP attention enabled

@yubofredwang yubofredwang requested a review from a team as a code owner May 24, 2026 01:42
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

write_output_logprobs(
logits_output, logits_output.next_token_logits, predict
)

P1 Badge Keep logits and token shapes aligned for output logprobs

In DP verify mode, predict is expanded to full-batch shape after gather_verify_outputs, but logits_output.next_token_logits is still this rank's local shard. If enable_output_logprobs is on, write_output_logprobs is called with mismatched row counts, which can fail or compute incorrect logprobs. The logprob write path needs full-batch logits (or a local-only token slice) before this call.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/flashinfer.py Outdated
@yubofredwang yubofredwang requested review from syuoni and yweng0828 May 24, 2026 06:55
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5b8440a7b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/flashinfer.py
Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
@yubofredwang yubofredwang force-pushed the ywang/dp-sampling branch 2 times, most recently from b940507 to 16db036 Compare May 24, 2026 08:29
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 16db036e43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9611a40417

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/flashinfer.py Outdated
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d9c9d7ba91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/communication/dp_sampling.py Outdated
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

# Conflicts:
#	python/tokenspeed/runtime/execution/model_executor.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 779aacda7d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py Outdated
Comment thread python/tokenspeed/runtime/execution/context.py Outdated
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need cuda_graph_wrapper.py and model_executor.py to be aware of DP sampling? I feel it may be cleaner to keep the DP sampling decision internal to logits_processor.py or the sampling wrappers, rather than exposing it to the higher-level execution path (I'm good to pass flags via controllers but feel like threshold decision etc can be internalized).

Copy link
Copy Markdown
Contributor Author

@yubofredwang yubofredwang May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think we can't determine the graph bucket inside logits processor, because we need 1. forward path padding for other ops like MoE layer + 2. padding due to the bsz % tp_size

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused:

  1. What is "forward path padding for other ops like MoE layer"?
  2. Could "padding due to the bsz % tp_size" be checked inside logits processor?

Copy link
Copy Markdown
Contributor Author

@yubofredwang yubofredwang Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocker to moving this entirely into LogitsProcessor is that LogitsProcessor runs too late. By then we already needed to know:

DP sampling decision changes the logits/hidden-state communication contract. In current code, DP mode makes LogitsProcessor either slice hidden states before LM head when skip_all_gather is true, or use DP batch-vocab swap instead of normal TP logits all-gather. Non-DP uses the normal TP logits all-gather path.

But the deeper issue is that the decision must be made before logits processing so graph replay, logits layout, and sampling verify all agree.
@LorrinWWW please take a look at the new design to use a logitslayout to represent the decision

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bb33395a62

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/distributed/dp_sampling_comm.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b331b59ff5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/sampling/backends/flashinfer.py Outdated
…-sampling

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/communication/dp_sampling.py Outdated
@yubofredwang
Copy link
Copy Markdown
Contributor Author

@syuoni I think the bigger issue is because LogitsProcessor needs to know whether to run all_gather or let sampling module run the all-to-all depending on dp sampling is enabled or not for this batch. Any suggestion?

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

# Conflicts:
#	python/tokenspeed/runtime/sampling/backends/flashinfer.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5ffca77a72

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>

# Conflicts:
#	python/tokenspeed/runtime/layers/logits_processor.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cecc87207d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

if forbid_global_logprob_writer:
import tokenspeed.runtime.sampling.backends.flashinfer as flashinfer_backend

original_writer = flashinfer_backend.write_output_logprobs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the dead writer monkeypatch

In the enable_output_logprobs=True parametrization this branch is always taken, but tokenspeed.runtime.sampling.backends.flashinfer does not define or import write_output_logprobs anymore (checked repo-wide), so the test raises AttributeError here before it ever calls backend.verify. This means the new DP output-logprobs regression test cannot run in CI; either drop this monkeypatch or guard it against the current implementation.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6cfb53fb71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

all_gather_into_tensor(
combined_full,
combined_local,
self._rank,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass the TP group, not rank, to all_gather

When the NCCL DP backend is used (either TOKENSPEED_DP_SAMPLING_BACKEND=nccl or auto on systems where the one-sided path is unavailable), this extra positional self._rank is bound as group and the real group is bound as backend, so the explicit backend= keyword raises TypeError: all_gather_into_tensor() got multiple values for argument 'backend' as soon as DP verify gathers outputs. The same call pattern is repeated for DP logprobs below; these calls should pass self._group as the third positional argument and omit the rank.

Useful? React with 👍 / 👎.

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9c62dbc68

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +332 to +334
dp_vocab_size = int(weight.shape[0]) * (
1 if processor.skip_all_gather else processor.tp_size
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pad replicated DP vocab before configuring comms

In the current patch the skip_all_gather branch now passes the replicated LM-head row count directly, but BaseCausalLM builds that head with config.vocab_size (not a TP-padded size). For TP+DP models whose vocab is not divisible by processor.tp_size, this value is then sent to configure_dp_sampling_vocab_size() and processor.configure_dp_sampling(), both of which construct DP comm helpers that require TP-divisible vocab sizes, so startup fails even though the replicated/full-vocab route does not need a vocab-shard swap. Please round/pad this value for the comm helpers or avoid constructing vocab-swap comm state when skip_all_gather is true.

Useful? React with 👍 / 👎.

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d68b7313b3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/logits_processor.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4619fb6516

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/cuda_graph_wrapper.py Outdated
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dd1dd0a8bc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +173 to +174
dp_sampling_enabled=True,
dp_num_tokens_per_req=n,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Configure the DP logits executor before exercising DP logits

These GPU parity tests call processor._get_logits(..., dp_sampling=True) later, but passing dp_sampling_enabled=True to the constructor only flips the boolean and does not create _logits_layout_executor; the executor is created by configure_dp_sampling(). As written, the non-skip_all_gather DP path raises RuntimeError("dp_sampling logits layout executor is not configured") before the test can compare against legacy output, so the new regression tests cannot pass on a GPU runner.

Useful? React with 👍 / 👎.

Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants