[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564

zhongbozhu · 2026-01-06T07:39:17Z

Description

Fix this: #2558

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

greptile-apps · 2026-01-06T07:41:55Z

Greptile Summary

Fixed critical NVFP4 performance regression by eliminating repeated cudaMallocAsync/cudaFreeAsync calls in hot path. The kernel function group_row_col_rht_gemm_ntt_w_sfc was allocating and freeing a 4-byte tile scheduler workspace on every invocation, causing severe performance degradation (570 TFLOPS to 240-400 TFLOPS).

Changes:

Modified group_row_col_rht_gemm_ntt_w_sfc to accept pre-allocated workspace pointer instead of managing allocation internally
Updated API signature to pass workspace buffer from caller (nvte_group_hadamard_transform_cast_fusion)
PyTorch extension now allocates workspace once per call using at::empty and passes it through the call chain
Workspace is reused with cudaMemsetAsync instead of repeated malloc/free operations

This aligns with best practices for CUDA performance optimization by avoiding synchronous allocation operations in frequently-called kernel functions.

Confidence Score: 5/5

This PR is safe to merge - it fixes a critical performance bug with a well-understood solution
The fix is straightforward and follows CUDA best practices by moving memory allocation out of the hot path. The change is minimal, focused, and directly addresses the root cause of the performance regression described in issue NVFP4 performance regression in TE main branch #2558. The workspace allocation strategy is sound: a small 4-byte buffer allocated once per call and properly validated before use.
No files require special attention

Important Files Changed

Filename	Overview
transformer_engine/common/hadamard_transform/group_row_cast_col_hadamard_transform_cast_fusion.cu	Changed from per-call `cudaMallocAsync`/`cudaFreeAsync` to caller-provided workspace, eliminating performance bottleneck from repeated allocations
transformer_engine/common/include/transformer_engine/hadamard_transform.h	Added `quant_workspace` parameter to function signature to support external workspace allocation
transformer_engine/pytorch/csrc/extensions/cast.cpp	Allocates 4-byte tile scheduler workspace once per call and passes it to the kernel function

Sequence Diagram

sequenceDiagram
    participant Caller as PyTorch Extension
    participant API as nvte_group_hadamard_transform_cast_fusion
    participant Core as group_hadamard_transform_cast_fusion
    participant Kernel as group_row_col_rht_gemm_ntt_w_sfc
    participant GPU as CUDA Kernel

    Note over Caller: BEFORE (Performance Issue)
    Caller->>API: Call without workspace
    API->>Core: Forward call (no workspace)
    Core->>Kernel: Launch kernel
    Note over Kernel: cudaMallocAsync (slow!)
    Kernel->>GPU: memset workspace to 0
    Kernel->>GPU: Launch CUDA kernel
    Note over GPU: Execute computation
    Note over Kernel: cudaFreeAsync (slow!)
    Kernel-->>Caller: Return

    Note over Caller: AFTER (This Fix)
    Caller->>Caller: Allocate 4-byte workspace once
    Caller->>API: Call with workspace parameter
    API->>Core: Forward call with workspace
    Core->>Core: Extract workspace pointer
    Core->>Kernel: Pass workspace to kernel
    Note over Kernel: memset workspace to 0 (reuse!)
    Kernel->>GPU: Launch CUDA kernel
    Note over GPU: Execute computation
    Kernel-->>Caller: Return (no deallocation needed)

zhongbozhu · 2026-01-06T08:01:54Z

/te-ci L1

ksivaman

There seem to be a lot of CI failures, ptal @zhongbozhu

CI

zhongbozhu · 2026-01-06T18:28:23Z

@ksivaman I don't see CI failures from my side, maybe it's because github is not informed super well with the CI status?

ptrendx · 2026-01-06T19:27:12Z

...former_engine/common/hadamard_transform/group_row_cast_col_hadamard_transform_cast_fusion.cu

                            /*args=*/kernel_args,
-                            /*rng_state=*/rng_state, /*sm_count=*/sm_count,
+                            /*rng_state=*/rng_state,
+                            /*tile_scheduler_workspace=*/tile_scheduler_workspace,


I would prefer a more generic workspace name to be honest. Proper handling of this would also require having some function that would return size of the required workspace.

from API level, it's called quant_workspace now

transformer_engine/common/common.h

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

transformer_engine/common/include/transformer_engine/hadamard_transform.h

timmoon10 · 2026-01-06T23:42:07Z

...former_engine/common/hadamard_transform/group_row_cast_col_hadamard_transform_cast_fusion.cu

+  NVTE_CHECK(quant_workspace.data.buffer_size_bytes() >= sizeof(uint32_t),
+             "Quantization workspace must be at least 4 bytes.");


If we wanted to be fancy, we could add an option to query the workspace size, similar to how we do it for LayerNorm. If the workspace is not provided, we set the NVTETensor with the required size. This way the caller doesn't need to know the details of the workspace size.

That said, I think this approach is fine for now.

transformer_engine/pytorch/csrc/extensions/cast.cpp

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10

LGTM, pending CI

timmoon10 · 2026-01-06T23:45:42Z

/te-ci

fix

500113d

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

zhongbozhu requested review from ksivaman and timmoon10 January 6, 2026 07:39

zhongbozhu self-assigned this Jan 6, 2026

This was referenced Jan 6, 2026

NVFP4 performance regression in TE main branch #2558

Open

[NVFP4][Dense/MoE] Integrate Cutlass NVFP4 Row-Cast-Col-RHT-Transpose-Cast Fusion Kernel #2555

Open

zhongbozhu added MoE fp4 bug Something isn't working labels Jan 6, 2026

ksivaman previously approved these changes Jan 6, 2026

View reviewed changes

ksivaman reviewed Jan 6, 2026

View reviewed changes

ptrendx reviewed Jan 6, 2026

View reviewed changes

timmoon10 requested changes Jan 6, 2026

View reviewed changes

transformer_engine/common/common.h Outdated Show resolved Hide resolved

resolve review comments

80543a6

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

This comment was marked as resolved.

Sign in to view

timmoon10 reviewed Jan 6, 2026

View reviewed changes

transformer_engine/pytorch/csrc/extensions/cast.cpp Outdated Show resolved Hide resolved

Comment tweaks

eba28af

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'main' into zhongbo/bug_fix_nvfp4_group_quant

d98ea37

timmoon10 merged commit de51c96 into NVIDIA:main Jan 7, 2026
40 of 42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564

zhongbozhu commented Jan 6, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

Uh oh!

zhongbozhu commented Jan 6, 2026

Uh oh!

ksivaman left a comment

Uh oh!

zhongbozhu commented Jan 6, 2026

Uh oh!

ptrendx Jan 6, 2026

Uh oh!

zhongbozhu Jan 6, 2026

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

timmoon10 Jan 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		NVTE_CHECK(quant_workspace.data.buffer_size_bytes() >= sizeof(uint32_t),
		"Quantization workspace must be at least 4 bytes.");

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564

[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564

Conversation

zhongbozhu commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

zhongbozhu commented Jan 6, 2026

Uh oh!

ksivaman left a comment

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Jan 6, 2026

Uh oh!

ptrendx Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Uh oh!

timmoon10 Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Jan 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhongbozhu commented Jan 6, 2026 •

edited

Loading

greptile-apps bot commented Jan 6, 2026 •

edited

Loading

timmoon10 Jan 6, 2026 •

edited

Loading