Support DP sampling for spec decode#232
Conversation
There was a problem hiding this comment.
💡 Codex Review
tokenspeed/python/tokenspeed/runtime/sampling/backends/flashinfer.py
Lines 522 to 524 in b940507
In DP verify mode, predict is expanded to full-batch shape after gather_verify_outputs, but logits_output.next_token_logits is still this rank's local shard. If enable_output_logprobs is on, write_output_logprobs is called with mismatched row counts, which can fail or compute incorrect logprobs. The logprob write path needs full-batch logits (or a local-only token slice) before this call.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5b8440a7b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
b940507 to
16db036
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 16db036e43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9611a40417
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
9611a40 to
d9c9d7b
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d9c9d7ba91
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com> # Conflicts: # python/tokenspeed/runtime/execution/model_executor.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 779aacda7d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Do we need cuda_graph_wrapper.py and model_executor.py to be aware of DP sampling? I feel it may be cleaner to keep the DP sampling decision internal to logits_processor.py or the sampling wrappers, rather than exposing it to the higher-level execution path (I'm good to pass flags via controllers but feel like threshold decision etc can be internalized).
There was a problem hiding this comment.
hmm I think we can't determine the graph bucket inside logits processor, because we need 1. forward path padding for other ops like MoE layer + 2. padding due to the bsz % tp_size
There was a problem hiding this comment.
I'm confused:
- What is "forward path padding for other ops like MoE layer"?
- Could "padding due to the bsz % tp_size" be checked inside logits processor?
There was a problem hiding this comment.
The blocker to moving this entirely into LogitsProcessor is that LogitsProcessor runs too late. By then we already needed to know:
DP sampling decision changes the logits/hidden-state communication contract. In current code, DP mode makes LogitsProcessor either slice hidden states before LM head when skip_all_gather is true, or use DP batch-vocab swap instead of normal TP logits all-gather. Non-DP uses the normal TP logits all-gather path.
But the deeper issue is that the decision must be made before logits processing so graph replay, logits layout, and sampling verify all agree.
@LorrinWWW please take a look at the new design to use a logitslayout to represent the decision
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bb33395a62
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b331b59ff5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…-sampling Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
|
@syuoni I think the bigger issue is because LogitsProcessor needs to know whether to run all_gather or let sampling module run the all-to-all depending on dp sampling is enabled or not for this batch. Any suggestion? |
Signed-off-by: Yubo Wang <yubowang2019@gmail.com> # Conflicts: # python/tokenspeed/runtime/sampling/backends/flashinfer.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5ffca77a72
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com> # Conflicts: # python/tokenspeed/runtime/layers/logits_processor.py
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cecc87207d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if forbid_global_logprob_writer: | ||
| import tokenspeed.runtime.sampling.backends.flashinfer as flashinfer_backend | ||
|
|
||
| original_writer = flashinfer_backend.write_output_logprobs |
There was a problem hiding this comment.
Remove the dead writer monkeypatch
In the enable_output_logprobs=True parametrization this branch is always taken, but tokenspeed.runtime.sampling.backends.flashinfer does not define or import write_output_logprobs anymore (checked repo-wide), so the test raises AttributeError here before it ever calls backend.verify. This means the new DP output-logprobs regression test cannot run in CI; either drop this monkeypatch or guard it against the current implementation.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6cfb53fb71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| all_gather_into_tensor( | ||
| combined_full, | ||
| combined_local, | ||
| self._rank, |
There was a problem hiding this comment.
Pass the TP group, not rank, to all_gather
When the NCCL DP backend is used (either TOKENSPEED_DP_SAMPLING_BACKEND=nccl or auto on systems where the one-sided path is unavailable), this extra positional self._rank is bound as group and the real group is bound as backend, so the explicit backend= keyword raises TypeError: all_gather_into_tensor() got multiple values for argument 'backend' as soon as DP verify gathers outputs. The same call pattern is repeated for DP logprobs below; these calls should pass self._group as the third positional argument and omit the rank.
Useful? React with 👍 / 👎.
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b9c62dbc68
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dp_vocab_size = int(weight.shape[0]) * ( | ||
| 1 if processor.skip_all_gather else processor.tp_size | ||
| ) |
There was a problem hiding this comment.
Pad replicated DP vocab before configuring comms
In the current patch the skip_all_gather branch now passes the replicated LM-head row count directly, but BaseCausalLM builds that head with config.vocab_size (not a TP-padded size). For TP+DP models whose vocab is not divisible by processor.tp_size, this value is then sent to configure_dp_sampling_vocab_size() and processor.configure_dp_sampling(), both of which construct DP comm helpers that require TP-divisible vocab sizes, so startup fails even though the replicated/full-vocab route does not need a vocab-shard swap. Please round/pad this value for the comm helpers or avoid constructing vocab-swap comm state when skip_all_gather is true.
Useful? React with 👍 / 👎.
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d68b7313b3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4619fb6516
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dd1dd0a8bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| dp_sampling_enabled=True, | ||
| dp_num_tokens_per_req=n, |
There was a problem hiding this comment.
Configure the DP logits executor before exercising DP logits
These GPU parity tests call processor._get_logits(..., dp_sampling=True) later, but passing dp_sampling_enabled=True to the constructor only flips the boolean and does not create _logits_layout_executor; the executor is created by configure_dp_sampling(). As written, the non-skip_all_gather DP path raises RuntimeError("dp_sampling logits layout executor is not configured") before the test can compare against legacy output, so the new regression tests cannot pass on a GPU runner.
Useful? React with 👍 / 👎.
Signed-off-by: Yubo Wang <yubowang2019@gmail.com>
Summary
Enable data parallele sampling. This has a few benefits:
Implemented:
Test Plan
Ran Kimi-K2.5 TP8 FP8 KV, batch 40, EAGLE3 spec 5/1/6, ctx=5000, max_tokens=4096, comparing TP sampling baseline vs DP sampling onesided.
DP sampling was about +5.9% on warm rounds 2-3.
Follow up