fix(PD): fix PD speculative bootstrap input seeding#286
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 564946a630
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| runtime_states.future_input_map[decode_req_pool_indices, 1:] = ( | ||
| torch.where(mask, ids.expand(-1, tail.shape[1]), tail) |
There was a problem hiding this comment.
Avoid verifying synthetic bootstrap tokens
When speculative decoding is enabled on a D-rank and a request first transitions from remote prefill, decode_input_ids contains only the verified bootstrap token, but get_candidates() later treats the entire future_input_map row as real draft candidates. Filling every draft slot with that same bootstrap token means the verifier can accept one or more phantom duplicate tokens whenever the target model predicts that id, causing accept_lengths and cache length to advance past tokens that were never drafted. These bootstrap rows need to be forced through a single-token/sample path or otherwise masked so the synthetic tail is never accepted as candidates.
Useful? React with 👍 / 👎.
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e57734a8a4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 90dee155b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
| # drafter or write_remote_spec_candidate_ids; for retract-recovery | ||
| # there is no draft source, so leaving the tail untouched lets the | ||
| # verifier reject those slots and the step naturally falls back to | ||
| # a 1-token decode. Broadcasting the seed across the tail would |
There was a problem hiding this comment.
"naturally falls back to a 1-token decode" isn't strictly guaranteed. future_input_map is torch.empty-initialized (runtime_states.py:52) and never reset on slot reuse, so cols 1..N-1 hold stale ids from a previous request (or uninitialized memory). If the target happens to predict one of those ids, the verifier accepts a phantom draft — same failure shape as the bug this PR fixes, just lower probability.
The PD remote-prefill path is safe because write_remote_spec_candidate_ids fills real candidates. The retract-recovery path (num_extends == 0 with decode_input_ids set) has no candidate source, so leaving the tail untouched is only probabilistically safe.
Consider seeding the tail with -1 (sentinel the verifier always rejects), or forcing this row through a 1-token path.
| req_pool_tensor = torch.tensor( | ||
| [req_pool_idx for req_pool_idx, _ in pairs], | ||
| dtype=torch.int64, | ||
| device=self.device, | ||
| ) | ||
| mamba_tensor = torch.tensor( | ||
| [mamba_idx for _, mamba_idx in pairs], | ||
| dtype=torch.int64, | ||
| device=self.device, | ||
| ) |
There was a problem hiding this comment.
Two synchronous H2D allocations in this PR:
-
Here (L944-953): both
torch.tensor(..., device=self.device)calls block. This function also runs on the default stream (noexecution_streamcontext), so the stall hits the main loop directly. -
runtime_states.py:85inwrite_remote_spec_candidate_ids— same pattern. The outer stream wrapper atmodel_executor.py:1369only affects kernel scheduling;torch.tensor(..., device=cuda)itself still blocks regardless of stream context.
Compare to reset_valid_cache_length (L994) and the new write_decode_input_ids helper, which both use the project-standard pattern:
torch.tensor(..., dtype=..., device="cpu", pin_memory=True).to(self.device, non_blocking=True)Suggest matching that pattern in both spots.
Summary
Test Plan
Test with