fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs by botieking98 · Pull Request #311 · lightseekorg/tokenspeed

botieking98 · 2026-05-29T13:08:05Z

Summary

Add _has_cluster_launch_support() to detect NVIDIA H20 GPUs (sm_90 without TMA cluster launch hardware) and skip the CuTe DSL ArgmaxKernel on them
H20 falls back transparently to torch.argmax — no user-facing change needed

Motivation

The CUTLASS DSL ArgmaxKernel uses cluster_dims > (1,1,1) via TMA. NVIDIA H20 GPUs report sm_90 (Hopper) but lack the hardware cluster launch capability, causing CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) at kernel launch time when the vocab dimension is large (e.g. Qwen3's N=151936).

Confirmed on H20-3e with tokenspeed_kernel==0.1.0:

N=151936 (real Qwen3 vocab): crashes for all M values (1-256)
N=32768: works fine (doesn't trigger multi-cluster tiling)

Approach

Device-name heuristic: H20 is the only known sm_90 SKU without cluster launch support. Other Hopper SKUs (H100, H200, H800) are unaffected.

Fixes #310

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c36de06987

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-29T13:09:50Z

+
+    # Only Hopper (sm_90) SKUs are affected -- Blackwell always supports
+    # cluster launches.
+    if p.arch_version.major \!= 9:


Fix the escaped inequality that prevents import

This backslash makes the file invalid Python (\!= is parsed as a line continuation followed by !), so importing tokenspeed_kernel.ops.sampling.cute_dsl now raises a SyntaxError before any fallback path can run, including on CPU/unsupported GPUs. I verified this with python -m py_compile tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/cute_dsl.py; changing it to != restores valid syntax.

Useful? React with 👍 / 👎.

NVIDIA H20 GPUs report sm_90 (Hopper) but lack hardware TMA cluster launch support. The CUTLASS DSL ArgmaxKernel uses cluster_dims > 1, which triggers CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) on H20 when the vocab dimension is large enough to require multi-cluster tiling. Add `_has_cluster_launch_support()` that detects H20 via device name and routes it through the torch.argmax fallback. Other Hopper SKUs (H100, H200, H800) are unaffected. Fixes lightseekorg#310 Signed-off-by: 葛平 <gepin.zs@antgroup.com>

The previous commit introduced a backslash-escaped `\\!=` operator due to shell escaping during file generation. Python rejects this as a syntax error. Replace with the correct `\!=` operator. Addresses Codex review P1 finding on commit c36de06. Signed-off-by: botieking <botieking98@gmail.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: 葛平 <gepin.zs@antgroup.com>

botieking98 · 2026-05-30T15:42:24Z

Bug-introducing commit verified

Bisect verification on H20-3e (saga pod) confirms:

Commit	Description	Result
`33ec617` (`9f82b3f`~1)	fix(kvstore): decouple write and execute stream (#279)	OK — server starts, inference succeeds
`9f82b3ff3911`	perf(sampling): fuse logits fp32 cast to argmax or softmax (#262)	CRASH — `CUDA_ERROR_INVALID_CLUSTER_SIZE` (error 912) on all TP ranks

Root cause in PR #262 — two changes combined:

logits_processor.py: changed .float() → .contiguous(), so logits stay bf16 instead of being cast to fp32
cute_dsl.py: removed the if dtype is not torch.float32: return False guard in _supports_cute()

Before #262, logits were always fp32 when reaching the CuTe DSL path, and the fp32 guard in _supports_cute served as a second safety net. After #262, bf16 logits go directly into ArgmaxKernel → bf16 + vocab 151936 → cluster_n = 16 → exceeds H20's MAX_CLUSTER_SIZE = 1 → crash.

Verification method: launched tokenspeed serve with Qwen3.6-27B (vocab_size=151936, tp=8)

botieking98 requested a review from a team as a code owner May 29, 2026 13:08

chatgpt-codex-connector Bot reviewed May 29, 2026

View reviewed changes

botieking98 force-pushed the fix/h20-cute-dsl-cluster-launch branch from c36de06 to 1c0077e Compare May 29, 2026 13:16

botieking98 force-pushed the fix/h20-cute-dsl-cluster-launch branch from 1c0077e to 7e8541b Compare May 29, 2026 13:18

botieking98 mentioned this pull request May 30, 2026

CuTe DSL ArgmaxKernel crashes with CUDA_ERROR_INVALID_CLUSTER_SIZE on NVIDIA H20 #310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs#311

fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs#311
botieking98 wants to merge 2 commits into
lightseekorg:mainfrom
botieking98:fix/h20-cute-dsl-cluster-launch

botieking98 commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Uh oh!

botieking98 commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

botieking98 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Approach

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

botieking98 commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bug-introducing commit verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

botieking98 commented May 29, 2026 •

edited

Loading

botieking98 commented May 30, 2026 •

edited

Loading