Skip to content

perf(Spec Decode): skip dead-position compute in draft catch-up step(decode)#217

Merged
LorrinWWW merged 22 commits into
lightseekorg:mainfrom
rjzhb:feat/eagle-draft-single-token
May 28, 2026
Merged

perf(Spec Decode): skip dead-position compute in draft catch-up step(decode)#217
LorrinWWW merged 22 commits into
lightseekorg:mainfrom
rjzhb:feat/eagle-draft-single-token

Conversation

@rjzhb
Copy link
Copy Markdown
Contributor

@rjzhb rjzhb commented May 22, 2026

Summary

Spec-decode draft head's catch-up first step takes a padded bs*spec_num_tokens input, but only one logit per request is sampled — the other bs*(spec_num_tokens-1) rows exist purely to write K/V into the KV cache for the next spec round. With lookahead=3 (4 tokens per request) that's 75% of the input rows whose attn / O-proj / MLP / norm outputs are computed and then immediately discarded at the LogitsProcessor's last-position slice.

This PR moves that slice inside the layer, right after KV write — instead of doing dead work on [bs*spec_num_tokens, H] and pruning at the end, we prune to [bs, H] between KV write and the post-attn ops, then run o_proj / MLP / norms on the bs live rows only. For the LLaMa Eagle3 head with a prewrite-capable attention backend, the slice also fires before attention so the decode kernel sees q_len_per_req=1 × full cache, saving the attention compute too.

What's saved per layer (live rows / total rows = 1/spec_num_tokens):

  • QKV proj — still full (K/V must be written for the next round)
  • Attn (Q·K^T + softmax·V)bs queries instead of bs*spec_num_tokens, only on prewrite-capable backends (today: LLaMa Eagle3 head's LlamaAttention)
  • O-projbs rows
  • post-attn norm / residual / MLP / post-mlp norm / final normbs rows

Scope

Covers all spec-decode draft head classes that go through drafter/eagle.py:

  • LlamaForCausalLMEagle3 — MHA EAGLE3 (models/llama_eagle3.py); both pre-attn Q-slice (decode kernel switch) and post-attn slice
  • Eagle3DeepseekV2ForCausalLM — MLA EAGLE3 (models/deepseek_v3.py); post-attn slice in DeepseekV3AttentionMLA.forward + residual slice in Eagle3MlaDecoderLayer
  • DeepseekV3ForCausalLMNextN — DeepSeek-V3 MTP/NextN (models/deepseek_nextn.py); residual slice in shared DeepseekV3DecoderLayer + comm_manager.final_norm delegation for fused-allreduce contract
  • Qwen3_5ForConditionalGenerationNextN + Qwen3_5MoeForConditionalGenerationNextN — Qwen3.5 MTP (models/qwen3_5_nextn.py); attn-output + residual slice in shared Qwen3_5AttentionDecoderLayer

End-to-end sim

MiniMax-M2.5 + thoughtworks/MiniMax-M2.5-Eagle3 head, B200 TP=2, reasoning-style workload (8K prompt / 3K gen, QPS=0.3, 300s sustain):

opt #1 opt #2 baseline
gen_tps (Loaded) 32.0 34.9 30.2
inflight_mean 26.7 22.7 25.1
mean_accept_len 2.02 1.92 1.94

Metric notes (measured over the steady-state window after warm-up)

  • gen_tps: system-wide decode tokens per second. The throughput metric this optimization targets.
  • inflight_mean: average number of concurrent requests being processed — i.e. the effective batch size. The optimization's gain scales with this number.
  • mean_accept_len: average accepted tokens per spec round. Close values across runs confirm spec decode behavior is unchanged, so the comparison is apples-to-apples.

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

Test plan

  • End-to-end sim A/B on MiniMax-M2.5 + EAGLE3 (above)
  • Additional A/B accuracy + bench data on Kimi-K2.5-NVFP4 (MLA path) — see comments below

@rjzhb rjzhb requested a review from a team as a code owner May 22, 2026 20:58
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cacabd6a6f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
@rjzhb rjzhb force-pushed the feat/eagle-draft-single-token branch from cacabd6 to ac042cd Compare May 22, 2026 21:05
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ac042cd43b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py
@rjzhb rjzhb force-pushed the feat/eagle-draft-single-token branch from ac042cd to d410fa6 Compare May 22, 2026 21:21
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d410fa6417

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
@LorrinWWW LorrinWWW self-assigned this May 25, 2026
@LorrinWWW
Copy link
Copy Markdown
Contributor

This is actually @woodyji 's idea, loop him in as well. Another related idea is to skip N-1 prefill tokens in the last layer.

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
Comment thread python/tokenspeed/runtime/execution/context.py Outdated
@rjzhb rjzhb force-pushed the feat/eagle-draft-single-token branch from d410fa6 to f3d467e Compare May 25, 2026 22:04
rjzhb added 3 commits May 25, 2026 22:51
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0df4e37ed

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fae72911fd

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/llama_eagle3.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e26faabf55

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
@mesaleh
Copy link
Copy Markdown

mesaleh commented May 26, 2026

I tested this EAGLE first-step reduce path against Kimi K2.6 NVFP4 + EAGLE3 + trtllm_mla on a local GB200 node. The direction here is right; after applying this branch, I still needed one Kimi/MLA-specific follow-up to move the ctx.gather_ids reduction before the MLA value projection and return the reduced buffer to o_proj.

That follow-up branch is here:

https://github.com/mesaleh/tokenspeed/tree/followup/pr217-kimi-mla-reduce

I did not open it as a direct PR to main because it is intentionally stacked on this PR and would otherwise duplicate/conflict with the EAGLE plumbing here.

I also opened two independent PRs from the same validation work that do not depend on or overlap with this branch:

Validation for the combined local fix set: Kimi K2.6 NVFP4 + EAGLE3 + trtllm_mla captured CUDA graph batch sizes [1, 2, 3, 4, 5, 6, 7, 8], reached healthy readiness, and completed a 10/10 OpenAI-compatible synthetic benchmark run.

@rjzhb rjzhb changed the title [WIP] perf(eagle3): skip dead-position compute in draft catch-up step perf(eagle3): skip dead-position compute in draft catch-up step May 26, 2026
@lightseek-bot
Copy link
Copy Markdown
Contributor

@rjzhb may you run this https://github.com/lightseekorg/tokenspeed/blob/main/test/agentic_benchmark/tokenspeed/agentic_bench.sh

Comment thread python/tokenspeed/runtime/execution/drafter/eagle.py Outdated
@borontion
Copy link
Copy Markdown
Contributor

borontion commented May 27, 2026

@rjzhb watch the mi355 test, we can merge if it passes

feel free to ignore the timeout in ut-runtime-1gpu / linux-mi355-1gpu-lightseek".

@LorrinWWW
Copy link
Copy Markdown
Contributor

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

are you sure @rjzhb

This may be a bit of overclaim hh. The benefit likely only appears when the batch size is large enough, since it mainly saves compute while the data movement overhead remains the same.

@rjzhb
Copy link
Copy Markdown
Contributor Author

rjzhb commented May 27, 2026

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.

are you sure @rjzhb

need a decode-heavy workload with short prompts and large batches to do experiment, since this change mainly targets decode. But with a general workload, the difference from baseline is barely visible.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e0f2e685c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/distributed/comm_manager.py
Comment thread python/tokenspeed/runtime/models/llama_eagle3.py
…comments

Signed-off-by: rjzhb <rjzhb222@163.com>
@rjzhb rjzhb changed the title perf(eagle3): skip dead-position compute in draft catch-up step(decode) perf(Spec Decode): skip dead-position compute in draft catch-up step(decode) May 27, 2026
… reduce

Signed-off-by: rjzhb <rjzhb222@163.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: db82f61222

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/deepseek_v3.py
@yweng0828
Copy link
Copy Markdown
Contributor

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.
Thank you for your optimization. It's a great work.

I'm interested in the improvement here. Could you please use this script to test the performance comparison between the w/ and w/o this optimization? This optimization will definitely bring performance improvement. It would be better if we could know more performance data.

Thanks.

@rjzhb
Copy link
Copy Markdown
Contributor Author

rjzhb commented May 27, 2026

Directional improvement ~+6% to +11% gen_tps vs baseline across both opt runs.
Thank you for your optimization. It's a great work.

I'm interested in the improvement here. Could you please use this script to test the performance comparison between the w/ and w/o this optimization? This optimization will definitely bring performance improvement. It would be better if we could know more performance data.

Thanks.

@rjzhb rjzhb closed this May 27, 2026
…lreduce

Signed-off-by: rjzhb <rjzhb222@163.com>
@rjzhb rjzhb reopened this May 27, 2026
@rjzhb
Copy link
Copy Markdown
Contributor Author

rjzhb commented May 27, 2026

Now testing agentic_bench.sh

@rjzhb
Copy link
Copy Markdown
Contributor Author

rjzhb commented May 28, 2026

@lightseek-bot @yweng0828 I ran the Kimi-K2.5-NVFP4 agentic bench in default setting.
At this scale, I don’t see a clear measurable win from reduce ON. The ON/OFF numbers are mostly within normal run-to-run noise: latency and throughput move by around 1–3%, cache hit is basically unchanged, and decoded tok/iter is also very close.

My read is that the bench is too small to expose this optimization. With max-num-seqs=16, the catch-up step is at most 16 * 4 = 64 rows, and that step is only a small part of the full spec cycle. So the expected end-to-end gain is probably around the same size as benchmark variance.
Also, this MLA path only has the post-attn slice optimization, not the Q-slice + decode-kernel switch that showed a bigger gain in the MiniMax MHA sim. So I think the current result is expected.

Also caught from nsys — the index_select the reduce path runs each catch-up step is ~0.3% of total kernel time.

## attn_tp4_moe_tp4 (B200)

### A (reduce ON)

| Conc | Latency (tps/user) | Throughput (tps/gpu) | Approx Cache Hit | Decoded Tok/Iter |
|---|---|---|---|---|
| 1  | 420.17 | 9072.30  | 90.93 | 3.3885 |
| 2  | 296.74 | 12925.61 | 91.08 | 3.3609 |
| 4  | 181.16 | 16898.29 | 90.24 | 3.1473 |
| 8  | 110.74 | 20717.89 | 90.09 | 3.3092 |
| 16 | 71.84  | 23722.54 | 90.56 | 3.3221 |

### B (reduce OFF)

| Conc | Latency (tps/user) | Throughput (tps/gpu) | Approx Cache Hit | Decoded Tok/Iter |
|---|---|---|---|---|
| 1  | 416.67 | 9151.52  | 90.93 | 3.3885 |
| 2  | 289.86 | 12945.82 | 91.08 | 3.2896 |
| 4  | 184.16 | 17025.78 | 90.24 | 3.1558 |
| 8  | 113.38 | 21179.63 | 90.10 | 3.3664 |
| 16 | 69.64  | 24443.96 | 90.56 | 3.3526 |

@LorrinWWW
Copy link
Copy Markdown
Contributor

The two failed cases seem to be machine issue. Any remaining concerns? @yweng0828 @syuoni @zhyncs

@zhyncs zhyncs requested review from syuoni and yweng0828 May 28, 2026 03:10
@LorrinWWW
Copy link
Copy Markdown
Contributor

I'm going to merge since there is no regression. We can patch anytime if needed

@LorrinWWW LorrinWWW merged commit a9bc218 into lightseekorg:main May 28, 2026
102 of 118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants