perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255
perf: TokenSpeed MLA decode kernel optimization for num_heads=16#255dishengbin wants to merge 11 commits into
Conversation
This reverts commit a7e4d88.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56c54099c5
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| @@ -332,7 +411,7 @@ def __call__( | |||
| softmax_scale: cutlass.Float32, | |||
| output_scale: cutlass.Float32, | |||
| stream: cuda.CUstream, | |||
| use_pdl: cutlass.Constexpr = False, | |||
| use_pdl: cutlass.Constexpr = True, | |||
There was a problem hiding this comment.
Keep PDL opt-in for FP8 decode entrypoint
Changing use_pdl to default True makes every direct compile of this kernel enable Programmatic Dependent Launch unless callers explicitly override it; in this file’s run() path, cute.compile(...) does not pass use_pdl, so benchmark/test runs now always require PDL-capable runtime support. On environments with older CUDA/driver stacks (or where PDL is intentionally disabled), this can fail kernel compile/launch and breaks previously working experiments. Please keep the default opt-in behavior (False) or plumb an explicit CLI/runtime toggle through this path.
Useful? React with 👍 / 👎.
@dishengbin Can we run the E2E agentic benchmark locally on B200 and share the performance before and after this optimization? |
Sure, will do. |
|
|
||
| is_sm100 = compute_capability == (10, 0) | ||
|
|
||
| if num_heads == 16 and seq_len_q == 4 and is_sm100: |
There was a problem hiding this comment.
Why do we gate off seq_len_q == 1?
Summary
Optimize MLA decode kernel for num_heads=16.
Notes: the perf gain is observed on B200, and the changes are only restricted to B200.
Test Plan