perf(Spec Decode): skip dead-position compute in draft catch-up step(decode) by rjzhb · Pull Request #217 · lightseekorg/tokenspeed

rjzhb · 2026-05-22T20:58:10Z

Summary

Spec-decode draft head's catch-up first step takes a padded bs*spec_num_tokens input, but only one logit per request is sampled — the other bs*(spec_num_tokens-1) rows exist purely to write K/V into the KV cache for the next spec round. With lookahead=3 (4 tokens per request) that's 75% of the input rows whose attn / O-proj / MLP / norm outputs are computed and then immediately discarded at the LogitsProcessor's last-position slice.

This PR moves that slice inside the layer, right after KV write — instead of doing dead work on [bs*spec_num_tokens, H] and pruning at the end, we prune to [bs, H] between KV write and the post-attn ops, then run o_proj / MLP / norms on the bs live rows only. For the LLaMa Eagle3 head with a prewrite-capable attention backend, the slice also fires before attention so the decode kernel sees q_len_per_req=1 × full cache, saving the attention compute too.

What's saved per layer (live rows / total rows = 1/spec_num_tokens):

QKV proj — still full (K/V must be written for the next round)
Attn (Q·K^T + softmax·V) — bs queries instead of bs*spec_num_tokens, only on prewrite-capable backends (today: LLaMa Eagle3 head's LlamaAttention)
O-proj — bs rows
post-attn norm / residual / MLP / post-mlp norm / final norm — bs rows

Scope

Covers all spec-decode draft head classes that go through drafter/eagle.py:

LlamaForCausalLMEagle3 — MHA EAGLE3 (models/llama_eagle3.py); both pre-attn Q-slice (decode kernel switch) and post-attn slice
Eagle3DeepseekV2ForCausalLM — MLA EAGLE3 (models/deepseek_v3.py); post-attn slice in DeepseekV3AttentionMLA.forward + residual slice in Eagle3MlaDecoderLayer
DeepseekV3ForCausalLMNextN — DeepSeek-V3 MTP/NextN (models/deepseek_nextn.py); residual slice in shared DeepseekV3DecoderLayer + comm_manager.final_norm delegation for fused-allreduce contract
Qwen3_5ForConditionalGenerationNextN + Qwen3_5MoeForConditionalGenerationNextN — Qwen3.5 MTP (models/qwen3_5_nextn.py); attn-output + residual slice in shared Qwen3_5AttentionDecoderLayer

End-to-end sim

MiniMax-M2.5 + thoughtworks/MiniMax-M2.5-Eagle3 head, B200 TP=2, reasoning-style workload (8K prompt / 3K gen, QPS=0.3, 300s sustain):

	opt #1	opt #2	baseline
`gen_tps` (Loaded)	32.0	34.9	30.2
`inflight_mean`	26.7	22.7	25.1
`mean_accept_len`	2.02	1.92	1.94

Metric notes (measured over the steady-state window after warm-up)

gen_tps: system-wide decode tokens per second. The throughput metric this optimization targets.

inflight_mean: average number of concurrent requests being processed — i.e. the effective batch size. The optimization's gain scales with this number.

mean_accept_len: average accepted tokens per spec round. Close values across runs confirm spec decode behavior is unchanged, so the comparison is apples-to-apples.

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

Test plan

End-to-end sim A/B on MiniMax-M2.5 + EAGLE3 (above)
Additional A/B accuracy + bench data on Kimi-K2.5-NVFP4 (MLA path) — see comments below

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cacabd6a6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac042cd43b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d410fa6417

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

LorrinWWW · 2026-05-25T05:32:25Z

This is actually @woodyji 's idea, loop him in as well. Another related idea is to skip N-1 prefill tokens in the last layer.

Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0df4e37ed

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fae72911fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: rjzhb <rjzhb222@163.com>

…ngle-token

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e26faabf55

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Signed-off-by: rjzhb <rjzhb222@163.com>

mesaleh · 2026-05-26T09:50:46Z

I tested this EAGLE first-step reduce path against Kimi K2.6 NVFP4 + EAGLE3 + trtllm_mla on a local GB200 node. The direction here is right; after applying this branch, I still needed one Kimi/MLA-specific follow-up to move the ctx.gather_ids reduction before the MLA value projection and return the reduced buffer to o_proj.

That follow-up branch is here:

https://github.com/mesaleh/tokenspeed/tree/followup/pr217-kimi-mla-reduce

I did not open it as a direct PR to main because it is intentionally stacked on this PR and would otherwise duplicate/conflict with the EAGLE plumbing here.

I also opened two independent PRs from the same validation work that do not depend on or overlap with this branch:

fix(trtllm-mla): make spec-decode CUDA graph capture causal #260 fixes trtllm_mla spec-decode CUDA graph capture metadata and multi-token causal decode bounds.
fix(deepseek): guard missing quant weight_block_size #261 guards DeepSeek/Kimi quant config loading when weight_block_size is not exposed.

Validation for the combined local fix set: Kimi K2.6 NVFP4 + EAGLE3 + trtllm_mla captured CUDA graph batch sizes [1, 2, 3, 4, 5, 6, 7, 8], reached healthy readiness, and completed a 10/10 OpenAI-compatible synthetic benchmark run.

lightseek-bot · 2026-05-27T03:32:36Z

@rjzhb may you run this https://github.com/lightseekorg/tokenspeed/blob/main/test/agentic_benchmark/tokenspeed/agentic_bench.sh

borontion · 2026-05-27T03:46:45Z

@rjzhb watch the mi355 test, we can merge if it passes

feel free to ignore the timeout in ut-runtime-1gpu / linux-mi355-1gpu-lightseek".

LorrinWWW · 2026-05-27T03:54:48Z

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

are you sure @rjzhb

This may be a bit of overclaim hh. The benefit likely only appears when the batch size is large enough, since it mainly saves compute while the data movement overhead remains the same.

rjzhb · 2026-05-27T04:01:42Z

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

are you sure @rjzhb

need a decode-heavy workload with short prompts and large batches to do experiment, since this change mainly targets decode. But with a general workload, the difference from baseline is barely visible.

Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0f2e685c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

…comments Signed-off-by: rjzhb <rjzhb222@163.com>

… reduce Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db82f61222

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

yweng0828 · 2026-05-27T07:49:52Z

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.
Thank you for your optimization. It's a great work.

I'm interested in the improvement here. Could you please use this script to test the performance comparison between the w/ and w/o this optimization? This optimization will definitely bring performance improvement. It would be better if we could know more performance data.

Thanks.

rjzhb · 2026-05-27T18:27:56Z

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.
Thank you for your optimization. It's a great work.

I'm interested in the improvement here. Could you please use this script to test the performance comparison between the w/ and w/o this optimization? This optimization will definitely bring performance improvement. It would be better if we could know more performance data.

Thanks.

…lreduce Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb · 2026-05-27T19:04:35Z

Now testing agentic_bench.sh

rjzhb · 2026-05-28T00:05:32Z

@lightseek-bot @yweng0828 I ran the Kimi-K2.5-NVFP4 agentic bench in default setting.
At this scale, I don’t see a clear measurable win from reduce ON. The ON/OFF numbers are mostly within normal run-to-run noise: latency and throughput move by around 1–3%, cache hit is basically unchanged, and decoded tok/iter is also very close.

My read is that the bench is too small to expose this optimization. With max-num-seqs=16, the catch-up step is at most 16 * 4 = 64 rows, and that step is only a small part of the full spec cycle. So the expected end-to-end gain is probably around the same size as benchmark variance.
Also, this MLA path only has the post-attn slice optimization, not the Q-slice + decode-kernel switch that showed a bigger gain in the MiniMax MHA sim. So I think the current result is expected.

Also caught from nsys — the index_select the reduce path runs each catch-up step is ~0.3% of total kernel time.

## attn_tp4_moe_tp4 (B200)

### A (reduce ON)

| Conc | Latency (tps/user) | Throughput (tps/gpu) | Approx Cache Hit | Decoded Tok/Iter |
|---|---|---|---|---|
| 1  | 420.17 | 9072.30  | 90.93 | 3.3885 |
| 2  | 296.74 | 12925.61 | 91.08 | 3.3609 |
| 4  | 181.16 | 16898.29 | 90.24 | 3.1473 |
| 8  | 110.74 | 20717.89 | 90.09 | 3.3092 |
| 16 | 71.84  | 23722.54 | 90.56 | 3.3221 |

### B (reduce OFF)

| Conc | Latency (tps/user) | Throughput (tps/gpu) | Approx Cache Hit | Decoded Tok/Iter |
|---|---|---|---|---|
| 1  | 416.67 | 9151.52  | 90.93 | 3.3885 |
| 2  | 289.86 | 12945.82 | 91.08 | 3.2896 |
| 4  | 184.16 | 17025.78 | 90.24 | 3.1558 |
| 8  | 113.38 | 21179.63 | 90.10 | 3.3664 |
| 16 | 69.64  | 24443.96 | 90.56 | 3.3526 |

LorrinWWW · 2026-05-28T03:09:17Z

The two failed cases seem to be machine issue. Any remaining concerns? @yweng0828 @syuoni @zhyncs

LorrinWWW · 2026-05-28T13:39:29Z

I'm going to merge since there is no regression. We can patch anytime if needed

rjzhb requested a review from a team as a code owner May 22, 2026 20:58

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

rjzhb force-pushed the feat/eagle-draft-single-token branch from cacabd6 to ac042cd Compare May 22, 2026 21:05

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py

rjzhb force-pushed the feat/eagle-draft-single-token branch from ac042cd to d410fa6 Compare May 22, 2026 21:21

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

LorrinWWW self-assigned this May 25, 2026

LorrinWWW assigned woodyji May 25, 2026

LorrinWWW reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

LorrinWWW reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/context.py Outdated

perf(eagle3): skip dead-position compute in draft catch-up step

f3d467e

Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb force-pushed the feat/eagle-draft-single-token branch from d410fa6 to f3d467e Compare May 25, 2026 22:04

rjzhb added 3 commits May 25, 2026 22:51

fix(eagle3): slice attn_output on non-prewrite attn path

e911f18

Signed-off-by: rjzhb <rjzhb222@163.com>

fix(eagle3): restrict draft-reduce to llama eagle3 head class

a0df4e3

Signed-off-by: rjzhb <rjzhb222@163.com>

chore(eagle3): trim comments on draft-reduce path

fae7291

Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py Outdated

rjzhb added 3 commits May 26, 2026 00:52

fix(eagle3): trim draft first-step seq_lens to accept length

c642f45

Signed-off-by: rjzhb <rjzhb222@163.com>

fix(comm): scatter on bs / global_bs under draft first-step reduce

b64b835

Signed-off-by: rjzhb <rjzhb222@163.com>

Merge remote-tracking branch 'upstream/main' into feat/eagle-draft-si…

e26faab

…ngle-token

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

rjzhb and others added 4 commits May 26, 2026 01:35

fix(eagle3): gate draft seq_lens correction to prewrite backends

d151cfb

Signed-off-by: rjzhb <rjzhb222@163.com>

refactor(eagle3): duck-type draft first-step reduce capability

e5ae5fa

Signed-off-by: rjzhb <rjzhb222@163.com>

feat(eagle3): support draft first-step reduce for MLA head

e242d39

Signed-off-by: rjzhb <rjzhb222@163.com>

Merge branch 'main' into feat/eagle-draft-single-token

876db44

mesaleh mentioned this pull request May 26, 2026

fix(trtllm-mla): make spec-decode CUDA graph capture causal #258

Closed

mesaleh mentioned this pull request May 26, 2026

fix(trtllm-mla): make spec-decode CUDA graph capture causal #260

Open

rjzhb changed the title ~~[WIP] perf(eagle3): skip dead-position compute in draft catch-up step~~ perf(eagle3): skip dead-position compute in draft catch-up step May 26, 2026

syuoni reviewed May 27, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated

rjzhb added 5 commits May 27, 2026 05:16

fix(eagle3): guard llama Eagle3 reduce slice against idle DP ranks

9e3c15a

Signed-off-by: rjzhb <rjzhb222@163.com>

feat(eagle3): support draft first-step reduce for DeepSeek NextN

9973dbe

Signed-off-by: rjzhb <rjzhb222@163.com>

feat(eagle3): support draft first-step reduce for Qwen3.5 MTP

4a2a2b7

Signed-off-by: rjzhb <rjzhb222@163.com>

refactor(eagle3): drop per-model supports_draft_first_step_reduce flag

161a324

Signed-off-by: rjzhb <rjzhb222@163.com>

chore(eagle3): trim verbose comments on reduce path

e0f2e68

Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/distributed/comm_manager.py

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py

chore(eagle3): replace EAGLE-only wording on draft_first_step_reduce …

4500a4f

…comments Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb changed the title ~~perf(eagle3): skip dead-position compute in draft catch-up step(decode)~~ perf(Spec Decode): skip dead-position compute in draft catch-up step(decode) May 27, 2026

fix(eagle3): align MoE collectives and final-norm decision with draft…

db82f61

… reduce Signed-off-by: rjzhb <rjzhb222@163.com>

chatgpt-codex-connector Bot reviewed May 27, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/models/deepseek_v3.py

rjzhb closed this May 27, 2026

fix(deepseek-nextn): delegate final norm to comm_manager for fused al…

8905345

…lreduce Signed-off-by: rjzhb <rjzhb222@163.com>

rjzhb reopened this May 27, 2026

Merge branch 'main' into feat/eagle-draft-single-token

02a9091

zhyncs requested review from syuoni and yweng0828 May 28, 2026 03:10

LorrinWWW merged commit a9bc218 into lightseekorg:main May 28, 2026
102 of 118 checks passed

rjzhb mentioned this pull request May 28, 2026

[WIP] refactor + perf(spec-decode): refactor #217 + add prefill scope #304

Open

syuoni mentioned this pull request May 29, 2026

perf: optimize logits allgather and parallelize eagle3 input projection #295

Merged

Conversation

rjzhb commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

End-to-end sim

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

LorrinWWW commented May 25, 2026

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mesaleh commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightseek-bot commented May 27, 2026

Uh oh!

Uh oh!

borontion commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LorrinWWW commented May 27, 2026

Uh oh!

rjzhb commented May 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

yweng0828 commented May 27, 2026

Uh oh!

rjzhb commented May 27, 2026

Uh oh!

rjzhb commented May 27, 2026

Uh oh!

rjzhb commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LorrinWWW commented May 28, 2026

Uh oh!

LorrinWWW commented May 28, 2026

Uh oh!

Uh oh!

rjzhb commented May 22, 2026 •

edited

Loading

mesaleh commented May 26, 2026 •

edited

Loading

borontion commented May 27, 2026 •

edited

Loading

rjzhb commented May 28, 2026 •

edited

Loading