fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs#311
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c36de06987
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| # Only Hopper (sm_90) SKUs are affected -- Blackwell always supports | ||
| # cluster launches. | ||
| if p.arch_version.major \!= 9: |
There was a problem hiding this comment.
Fix the escaped inequality that prevents import
This backslash makes the file invalid Python (\!= is parsed as a line continuation followed by !), so importing tokenspeed_kernel.ops.sampling.cute_dsl now raises a SyntaxError before any fallback path can run, including on CPU/unsupported GPUs. I verified this with python -m py_compile tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/cute_dsl.py; changing it to != restores valid syntax.
Useful? React with 👍 / 👎.
c36de06 to
1c0077e
Compare
NVIDIA H20 GPUs report sm_90 (Hopper) but lack hardware TMA cluster launch support. The CUTLASS DSL ArgmaxKernel uses cluster_dims > 1, which triggers CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) on H20 when the vocab dimension is large enough to require multi-cluster tiling. Add `_has_cluster_launch_support()` that detects H20 via device name and routes it through the torch.argmax fallback. Other Hopper SKUs (H100, H200, H800) are unaffected. Fixes lightseekorg#310 Signed-off-by: 葛平 <gepin.zs@antgroup.com>
1c0077e to
7e8541b
Compare
The previous commit introduced a backslash-escaped `\\!=` operator due to shell escaping during file generation. Python rejects this as a syntax error. Replace with the correct `\!=` operator. Addresses Codex review P1 finding on commit c36de06. Signed-off-by: botieking <botieking98@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: 葛平 <gepin.zs@antgroup.com>
Bug-introducing commit verifiedBisect verification on H20-3e (saga pod) confirms:
Root cause in PR #262 — two changes combined:
Before #262, logits were always fp32 when reaching the CuTe DSL path, and the fp32 guard in Verification method: launched |
Summary
_has_cluster_launch_support()to detect NVIDIA H20 GPUs (sm_90 without TMA cluster launch hardware) and skip the CuTe DSL ArgmaxKernel on themtorch.argmax— no user-facing change neededMotivation
The CUTLASS DSL
ArgmaxKernelusescluster_dims > (1,1,1)via TMA. NVIDIA H20 GPUs report sm_90 (Hopper) but lack the hardware cluster launch capability, causingCUDA_ERROR_INVALID_CLUSTER_SIZE(error 912) at kernel launch time when the vocab dimension is large (e.g. Qwen3's N=151936).Confirmed on H20-3e with
tokenspeed_kernel==0.1.0:Approach
Device-name heuristic: H20 is the only known sm_90 SKU without cluster launch support. Other Hopper SKUs (H100, H200, H800) are unaffected.
Fixes #310