Skip to content

fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs#311

Open
botieking98 wants to merge 2 commits into
lightseekorg:mainfrom
botieking98:fix/h20-cute-dsl-cluster-launch
Open

fix(cute_dsl): skip CuTe DSL argmax kernel on H20 GPUs#311
botieking98 wants to merge 2 commits into
lightseekorg:mainfrom
botieking98:fix/h20-cute-dsl-cluster-launch

Conversation

@botieking98
Copy link
Copy Markdown

@botieking98 botieking98 commented May 29, 2026

Summary

  • Add _has_cluster_launch_support() to detect NVIDIA H20 GPUs (sm_90 without TMA cluster launch hardware) and skip the CuTe DSL ArgmaxKernel on them
  • H20 falls back transparently to torch.argmax — no user-facing change needed

Motivation

The CUTLASS DSL ArgmaxKernel uses cluster_dims > (1,1,1) via TMA. NVIDIA H20 GPUs report sm_90 (Hopper) but lack the hardware cluster launch capability, causing CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) at kernel launch time when the vocab dimension is large (e.g. Qwen3's N=151936).

Confirmed on H20-3e with tokenspeed_kernel==0.1.0:

  • N=151936 (real Qwen3 vocab): crashes for all M values (1-256)
  • N=32768: works fine (doesn't trigger multi-cluster tiling)

Approach

Device-name heuristic: H20 is the only known sm_90 SKU without cluster launch support. Other Hopper SKUs (H100, H200, H800) are unaffected.

Fixes #310

@botieking98 botieking98 requested a review from a team as a code owner May 29, 2026 13:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c36de06987

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


# Only Hopper (sm_90) SKUs are affected -- Blackwell always supports
# cluster launches.
if p.arch_version.major \!= 9:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fix the escaped inequality that prevents import

This backslash makes the file invalid Python (\!= is parsed as a line continuation followed by !), so importing tokenspeed_kernel.ops.sampling.cute_dsl now raises a SyntaxError before any fallback path can run, including on CPU/unsupported GPUs. I verified this with python -m py_compile tokenspeed-kernel/python/tokenspeed_kernel/ops/sampling/cute_dsl.py; changing it to != restores valid syntax.

Useful? React with 👍 / 👎.

@botieking98 botieking98 force-pushed the fix/h20-cute-dsl-cluster-launch branch from c36de06 to 1c0077e Compare May 29, 2026 13:16
NVIDIA H20 GPUs report sm_90 (Hopper) but lack hardware TMA cluster
launch support. The CUTLASS DSL ArgmaxKernel uses cluster_dims > 1,
which triggers CUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) on H20 when
the vocab dimension is large enough to require multi-cluster tiling.

Add `_has_cluster_launch_support()` that detects H20 via device name
and routes it through the torch.argmax fallback. Other Hopper SKUs
(H100, H200, H800) are unaffected.

Fixes lightseekorg#310

Signed-off-by: 葛平 <gepin.zs@antgroup.com>
@botieking98 botieking98 force-pushed the fix/h20-cute-dsl-cluster-launch branch from 1c0077e to 7e8541b Compare May 29, 2026 13:18
The previous commit introduced a backslash-escaped `\\!=` operator due to
shell escaping during file generation. Python rejects this as a syntax
error. Replace with the correct `\!=` operator.

Addresses Codex review P1 finding on commit c36de06.

Signed-off-by: botieking <botieking98@gmail.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: 葛平 <gepin.zs@antgroup.com>
@botieking98
Copy link
Copy Markdown
Author

botieking98 commented May 30, 2026

Bug-introducing commit verified

Bisect verification on H20-3e (saga pod) confirms:

Commit Description Result
33ec617 (9f82b3f~1) fix(kvstore): decouple write and execute stream (#279) OK — server starts, inference succeeds
9f82b3ff3911 perf(sampling): fuse logits fp32 cast to argmax or softmax (#262) CRASHCUDA_ERROR_INVALID_CLUSTER_SIZE (error 912) on all TP ranks

Root cause in PR #262 — two changes combined:

  1. logits_processor.py: changed .float().contiguous(), so logits stay bf16 instead of being cast to fp32
  2. cute_dsl.py: removed the if dtype is not torch.float32: return False guard in _supports_cute()

Before #262, logits were always fp32 when reaching the CuTe DSL path, and the fp32 guard in _supports_cute served as a second safety net. After #262, bf16 logits go directly into ArgmaxKernel → bf16 + vocab 151936 → cluster_n = 16 → exceeds H20's MAX_CLUSTER_SIZE = 1 → crash.

Verification method: launched tokenspeed serve with Qwen3.6-27B (vocab_size=151936, tp=8)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CuTe DSL ArgmaxKernel crashes with CUDA_ERROR_INVALID_CLUSTER_SIZE on NVIDIA H20

1 participant