Skip to content

fix(PD): fix PD speculative bootstrap input seeding#286

Open
XucSh wants to merge 5 commits into
mainfrom
Xuchun/pd-fix
Open

fix(PD): fix PD speculative bootstrap input seeding#286
XucSh wants to merge 5 commits into
mainfrom
Xuchun/pd-fix

Conversation

@XucSh
Copy link
Copy Markdown
Contributor

@XucSh XucSh commented May 28, 2026

Summary

Test Plan

Test with

  1. Qwen3.5-397B-A17B-NVFP4 with MTP
  2. Kimi-K2.5-NVFP4 with EAGLE3

@XucSh XucSh requested a review from a team as a code owner May 28, 2026 01:47
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 564946a630

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +203 to +204
runtime_states.future_input_map[decode_req_pool_indices, 1:] = (
torch.where(mask, ids.expand(-1, tail.shape[1]), tail)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid verifying synthetic bootstrap tokens

When speculative decoding is enabled on a D-rank and a request first transitions from remote prefill, decode_input_ids contains only the verified bootstrap token, but get_candidates() later treats the entire future_input_map row as real draft candidates. Filling every draft slot with that same bootstrap token means the verifier can accept one or more phantom duplicate tokens whenever the target model predicts that id, causing accept_lengths and cache length to advance past tokens that were never drafted. These bootstrap rows need to be forced through a single-token/sample path or otherwise masked so the synthetic tail is never accepted as candidates.

Useful? React with 👍 / 👎.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e57734a8a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/input_buffer.py Outdated
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90dee155b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/pd/mooncake/prefill.py
@XucSh XucSh requested a review from tuanzhangCS May 29, 2026 00:56
XucSh added 2 commits May 29, 2026 02:30
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>
# drafter or write_remote_spec_candidate_ids; for retract-recovery
# there is no draft source, so leaving the tail untouched lets the
# verifier reject those slots and the step naturally falls back to
# a 1-token decode. Broadcasting the seed across the tail would
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"naturally falls back to a 1-token decode" isn't strictly guaranteed. future_input_map is torch.empty-initialized (runtime_states.py:52) and never reset on slot reuse, so cols 1..N-1 hold stale ids from a previous request (or uninitialized memory). If the target happens to predict one of those ids, the verifier accepts a phantom draft — same failure shape as the bug this PR fixes, just lower probability.

The PD remote-prefill path is safe because write_remote_spec_candidate_ids fills real candidates. The retract-recovery path (num_extends == 0 with decode_input_ids set) has no candidate source, so leaving the tail untouched is only probabilistically safe.

Consider seeding the tail with -1 (sentinel the verifier always rejects), or forcing this row through a 1-token path.

Comment on lines +944 to +953
req_pool_tensor = torch.tensor(
[req_pool_idx for req_pool_idx, _ in pairs],
dtype=torch.int64,
device=self.device,
)
mamba_tensor = torch.tensor(
[mamba_idx for _, mamba_idx in pairs],
dtype=torch.int64,
device=self.device,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two synchronous H2D allocations in this PR:

  1. Here (L944-953): both torch.tensor(..., device=self.device) calls block. This function also runs on the default stream (no execution_stream context), so the stall hits the main loop directly.

  2. runtime_states.py:85 in write_remote_spec_candidate_ids — same pattern. The outer stream wrapper at model_executor.py:1369 only affects kernel scheduling; torch.tensor(..., device=cuda) itself still blocks regardless of stream context.

Compare to reset_valid_cache_length (L994) and the new write_decode_input_ids helper, which both use the project-standard pattern:

torch.tensor(..., dtype=..., device="cpu", pin_memory=True).to(self.device, non_blocking=True)

Suggest matching that pattern in both spots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants