perf(Spec Decode): skip dead-position compute in draft catch-up step(decode)#217
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cacabd6a6f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
cacabd6 to
ac042cd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ac042cd43b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
ac042cd to
d410fa6
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d410fa6417
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
This is actually @woodyji 's idea, loop him in as well. Another related idea is to skip |
Signed-off-by: rjzhb <rjzhb222@163.com>
d410fa6 to
f3d467e
Compare
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0df4e37ed
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fae72911fd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e26faabf55
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
|
I tested this EAGLE first-step reduce path against Kimi K2.6 NVFP4 + EAGLE3 + That follow-up branch is here: https://github.com/mesaleh/tokenspeed/tree/followup/pr217-kimi-mla-reduce I did not open it as a direct PR to I also opened two independent PRs from the same validation work that do not depend on or overlap with this branch:
Validation for the combined local fix set: Kimi K2.6 NVFP4 + EAGLE3 + |
feel free to ignore the timeout in ut-runtime-1gpu / linux-mi355-1gpu-lightseek". |
This may be a bit of overclaim hh. The benefit likely only appears when the batch size is large enough, since it mainly saves compute while the data movement overhead remains the same. |
need a decode-heavy workload with short prompts and large batches to do experiment, since this change mainly targets decode. But with a general workload, the difference from baseline is barely visible. |
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e0f2e685c5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…comments Signed-off-by: rjzhb <rjzhb222@163.com>
… reduce Signed-off-by: rjzhb <rjzhb222@163.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: db82f61222
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
I'm interested in the improvement here. Could you please use this script to test the performance comparison between the Thanks. |
|
…lreduce Signed-off-by: rjzhb <rjzhb222@163.com>
|
Now testing agentic_bench.sh |
|
@lightseek-bot @yweng0828 I ran the Kimi-K2.5-NVFP4 agentic bench in default setting. My read is that the bench is too small to expose this optimization. With max-num-seqs=16, the catch-up step is at most 16 * 4 = 64 rows, and that step is only a small part of the full spec cycle. So the expected end-to-end gain is probably around the same size as benchmark variance. Also caught from nsys — the index_select the reduce path runs each catch-up step is ~0.3% of total kernel time. |
|
The two failed cases seem to be machine issue. Any remaining concerns? @yweng0828 @syuoni @zhyncs |
|
I'm going to merge since there is no regression. We can patch anytime if needed |
Summary
Spec-decode draft head's catch-up first step takes a padded
bs*spec_num_tokensinput, but only one logit per request is sampled — the otherbs*(spec_num_tokens-1)rows exist purely to writeK/Vinto the KV cache for the next spec round. Withlookahead=3(4 tokens per request) that's 75% of the input rows whose attn / O-proj / MLP / norm outputs are computed and then immediately discarded at the LogitsProcessor's last-position slice.This PR moves that slice inside the layer, right after KV write — instead of doing dead work on
[bs*spec_num_tokens, H]and pruning at the end, we prune to[bs, H]between KV write and the post-attn ops, then run o_proj / MLP / norms on thebslive rows only. For the LLaMa Eagle3 head with a prewrite-capable attention backend, the slice also fires before attention so the decode kernel seesq_len_per_req=1× full cache, saving the attention compute too.What's saved per layer (live rows / total rows =
1/spec_num_tokens):bsqueries instead ofbs*spec_num_tokens, only on prewrite-capable backends (today: LLaMa Eagle3 head'sLlamaAttention)bsrowsbsrowsScope
Covers all spec-decode draft head classes that go through
drafter/eagle.py:LlamaForCausalLMEagle3— MHA EAGLE3 (models/llama_eagle3.py); both pre-attn Q-slice (decode kernel switch) and post-attn sliceEagle3DeepseekV2ForCausalLM— MLA EAGLE3 (models/deepseek_v3.py); post-attn slice inDeepseekV3AttentionMLA.forward+ residual slice inEagle3MlaDecoderLayerDeepseekV3ForCausalLMNextN— DeepSeek-V3 MTP/NextN (models/deepseek_nextn.py); residual slice in sharedDeepseekV3DecoderLayer+comm_manager.final_normdelegation for fused-allreduce contractQwen3_5ForConditionalGenerationNextN+Qwen3_5MoeForConditionalGenerationNextN— Qwen3.5 MTP (models/qwen3_5_nextn.py); attn-output + residual slice in sharedQwen3_5AttentionDecoderLayerEnd-to-end sim
MiniMax-M2.5 +
thoughtworks/MiniMax-M2.5-Eagle3head, B200 TP=2, reasoning-style workload (8K prompt / 3K gen, QPS=0.3, 300s sustain):gen_tps(Loaded)inflight_meanmean_accept_lenDirectional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.
Test plan