fix(PD): fix PD speculative bootstrap input seeding by XucSh · Pull Request #286 · lightseekorg/tokenspeed

XucSh · 2026-05-28T01:47:17Z

Summary

Test Plan

Test with

Qwen3.5-397B-A17B-NVFP4 with MTP
Kimi-K2.5-NVFP4 with EAGLE3

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 564946a630

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T01:49:46Z

+                runtime_states.future_input_map[decode_req_pool_indices, 1:] = (
+                    torch.where(mask, ids.expand(-1, tail.shape[1]), tail)


Avoid verifying synthetic bootstrap tokens

When speculative decoding is enabled on a D-rank and a request first transitions from remote prefill, decode_input_ids contains only the verified bootstrap token, but get_candidates() later treats the entire future_input_map row as real draft candidates. Filling every draft slot with that same bootstrap token means the verifier can accept one or more phantom duplicate tokens whenever the target model predicts that id, causing accept_lengths and cache length to advance past tokens that were never drafted. These bootstrap rows need to be forced through a single-token/sample path or otherwise masked so the synthetic tail is never accepted as candidates.

Useful? React with 👍 / 👎.

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e57734a8a4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 90dee155b6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

tuanzhangCS · 2026-05-29T10:14:19Z

+            # drafter or write_remote_spec_candidate_ids; for retract-recovery
+            # there is no draft source, so leaving the tail untouched lets the
+            # verifier reject those slots and the step naturally falls back to
+            # a 1-token decode. Broadcasting the seed across the tail would


"naturally falls back to a 1-token decode" isn't strictly guaranteed. future_input_map is torch.empty-initialized (runtime_states.py:52) and never reset on slot reuse, so cols 1..N-1 hold stale ids from a previous request (or uninitialized memory). If the target happens to predict one of those ids, the verifier accepts a phantom draft — same failure shape as the bug this PR fixes, just lower probability.

The PD remote-prefill path is safe because write_remote_spec_candidate_ids fills real candidates. The retract-recovery path (num_extends == 0 with decode_input_ids set) has no candidate source, so leaving the tail untouched is only probabilistically safe.

Consider seeding the tail with -1 (sentinel the verifier always rejects), or forcing this row through a 1-token path.

tuanzhangCS · 2026-05-29T10:23:28Z

+        req_pool_tensor = torch.tensor(
+            [req_pool_idx for req_pool_idx, _ in pairs],
+            dtype=torch.int64,
+            device=self.device,
+        )
+        mamba_tensor = torch.tensor(
+            [mamba_idx for _, mamba_idx in pairs],
+            dtype=torch.int64,
+            device=self.device,
+        )


Two synchronous H2D allocations in this PR:

Here (L944-953): both torch.tensor(..., device=self.device) calls block. This function also runs on the default stream (no execution_stream context), so the stall hits the main loop directly.

runtime_states.py:85 in write_remote_spec_candidate_ids — same pattern. The outer stream wrapper at model_executor.py:1369 only affects kernel scheduling; torch.tensor(..., device=cuda) itself still blocks regardless of stream context.

Compare to reset_valid_cache_length (L994) and the new write_decode_input_ids helper, which both use the project-standard pattern:

torch.tensor(..., dtype=..., device="cpu", pin_memory=True).to(self.device, non_blocking=True)

Suggest matching that pattern in both spots.

XucSh requested a review from a team as a code owner May 28, 2026 01:47

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

fix(pd): spec compatible

f9fb97f

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

XucSh force-pushed the Xuchun/pd-fix branch from 72a3757 to f9fb97f Compare May 28, 2026 09:03

fix lint

e57734a

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/input_buffer.py Outdated

fix qwen

90dee15

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/pd/mooncake/prefill.py

XucSh requested a review from tuanzhangCS May 29, 2026 00:56

XucSh added 2 commits May 29, 2026 02:30

fix spec tail when retract

c9176fb

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

fix(pd): clean up per-request state

ccde288

Signed-off-by: Xuchun Shang <xuchun.shang@gmail.com>

tuanzhangCS reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(PD): fix PD speculative bootstrap input seeding#286

fix(PD): fix PD speculative bootstrap input seeding#286
XucSh wants to merge 5 commits into
mainfrom
Xuchun/pd-fix

XucSh commented May 28, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

tuanzhangCS May 29, 2026

Uh oh!

tuanzhangCS May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		runtime_states.future_input_map[decode_req_pool_indices, 1:] = (
		torch.where(mask, ids.expand(-1, tail.shape[1]), tail)

Conversation

XucSh commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

tuanzhangCS May 29, 2026

Choose a reason for hiding this comment

Uh oh!

tuanzhangCS May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

XucSh commented May 28, 2026 •

edited

Loading