perf: TokenSpeed MLA decode kernel optimization for num_heads=16 by dishengbin · Pull Request #255 · lightseekorg/tokenspeed

dishengbin · 2026-05-26T05:48:28Z

Summary

Optimize MLA decode kernel for num_heads=16.

batch_size	Before us	After us	Speedup
1	32.0000	28.9792	1.104x
2	34.0944	33.9440	1.004x
4	47.5648	45.3120	1.050x
8	81.1504	71.8384	1.130x
16	159.8576	126.4128	1.265x

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

Test Plan

This reverts commit a7e4d88.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56c54099c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-26T05:54:09Z

@@ -332,7 +411,7 @@ def __call__(
        softmax_scale: cutlass.Float32,
        output_scale: cutlass.Float32,
        stream: cuda.CUstream,
-        use_pdl: cutlass.Constexpr = False,
+        use_pdl: cutlass.Constexpr = True,


Keep PDL opt-in for FP8 decode entrypoint

Changing use_pdl to default True makes every direct compile of this kernel enable Programmatic Dependent Launch unless callers explicitly override it; in this file’s run() path, cute.compile(...) does not pass use_pdl, so benchmark/test runs now always require PDL-capable runtime support. On environments with older CUDA/driver stacks (or where PDL is intentionally disabled), this can fail kernel compile/launch and breaks previously working experiments. Please keep the default opt-in behavior (False) or plumb an explicit CLI/runtime toggle through this path.

Useful? React with 👍 / 👎.

lightseek-bot · 2026-05-27T08:28:50Z

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

@dishengbin Can we run the E2E agentic benchmark locally on B200 and share the performance before and after this optimization?

dishengbin · 2026-05-28T05:33:44Z

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

@dishengbin Can we run the E2E agentic benchmark locally on B200 and share the performance before and after this optimization?

Sure, will do.

syuoni · 2026-05-28T10:48:03Z

+
+    is_sm100 = compute_capability == (10, 0)
+
+    if num_heads == 16 and seq_len_q == 4 and is_sm100:


Why do we gate off seq_len_q == 1?

adi added 6 commits May 21, 2026 21:22

TokenSpeed MLA decode kernel optimization

9771e8a

add function to get the correct mma_tile for num_heads=16

c0a4dbd

reformat

f251395

for debug: keep split-kv the same with 2cta version

a7e4d88

fix LSE issue

912e381

Revert "for debug: keep split-kv the same with 2cta version"

56c5409

This reverts commit a7e4d88.

dishengbin requested a review from a team as a code owner May 26, 2026 05:48

Merge branch 'main' into adi/mla-decode-optimization

ef302eb

chatgpt-codex-connector Bot reviewed May 26, 2026

View reviewed changes

adi added 3 commits May 26, 2026 01:28

fix kv-split calculation

636e648

reformat

b68220a

restrict the changes to SM100

97a5e73

syuoni reviewed May 28, 2026

View reviewed changes

syuoni marked this pull request as draft May 28, 2026 14:07

change the condition for M64 path

097b959

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255

perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255
dishengbin wants to merge 11 commits into
mainfrom
adi/mla-decode-optimization

dishengbin commented May 26, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Uh oh!

lightseek-bot commented May 27, 2026

Uh oh!

dishengbin commented May 28, 2026

Uh oh!

syuoni May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		is_sm100 = compute_capability == (10, 0)

		if num_heads == 16 and seq_len_q == 4 and is_sm100:

Conversation

dishengbin commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

lightseek-bot commented May 27, 2026

Uh oh!

dishengbin commented May 28, 2026

Uh oh!

syuoni May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dishengbin commented May 26, 2026 •

edited

Loading