Skip to content

perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255

Draft
dishengbin wants to merge 11 commits into
mainfrom
adi/mla-decode-optimization
Draft

perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255
dishengbin wants to merge 11 commits into
mainfrom
adi/mla-decode-optimization

Conversation

@dishengbin
Copy link
Copy Markdown
Contributor

@dishengbin dishengbin commented May 26, 2026

Summary

Optimize MLA decode kernel for num_heads=16.

batch_size Before us After us Speedup
1 32.0000 28.9792 1.104x
2 34.0944 33.9440 1.004x
4 47.5648 45.3120 1.050x
8 81.1504 71.8384 1.130x
16 159.8576 126.4128 1.265x

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

Test Plan

@dishengbin dishengbin requested a review from a team as a code owner May 26, 2026 05:48
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56c54099c5

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -332,7 +411,7 @@ def __call__(
softmax_scale: cutlass.Float32,
output_scale: cutlass.Float32,
stream: cuda.CUstream,
use_pdl: cutlass.Constexpr = False,
use_pdl: cutlass.Constexpr = True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep PDL opt-in for FP8 decode entrypoint

Changing use_pdl to default True makes every direct compile of this kernel enable Programmatic Dependent Launch unless callers explicitly override it; in this file’s run() path, cute.compile(...) does not pass use_pdl, so benchmark/test runs now always require PDL-capable runtime support. On environments with older CUDA/driver stacks (or where PDL is intentionally disabled), this can fail kernel compile/launch and breaks previously working experiments. Please keep the default opt-in behavior (False) or plumb an explicit CLI/runtime toggle through this path.

Useful? React with 👍 / 👎.

@lightseek-bot
Copy link
Copy Markdown
Contributor

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

@dishengbin Can we run the E2E agentic benchmark locally on B200 and share the performance before and after this optimization?

@dishengbin
Copy link
Copy Markdown
Contributor Author

Notes: the perf gain is observed on B200, and the changes are only restricted to B200.

@dishengbin Can we run the E2E agentic benchmark locally on B200 and share the performance before and after this optimization?

Sure, will do.


is_sm100 = compute_capability == (10, 0)

if num_heads == 16 and seq_len_q == 4 and is_sm100:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we gate off seq_len_q == 1?

@syuoni syuoni marked this pull request as draft May 28, 2026 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants