-
Notifications
You must be signed in to change notification settings - Fork 599
[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVFP4][MOE] Bug Fix for NVFP4 Grouped Quant #2564
Conversation
Greptile SummaryFixed critical NVFP4 performance regression by eliminating repeated Changes:
This aligns with best practices for CUDA performance optimization by avoiding synchronous allocation operations in frequently-called kernel functions. Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as PyTorch Extension
participant API as nvte_group_hadamard_transform_cast_fusion
participant Core as group_hadamard_transform_cast_fusion
participant Kernel as group_row_col_rht_gemm_ntt_w_sfc
participant GPU as CUDA Kernel
Note over Caller: BEFORE (Performance Issue)
Caller->>API: Call without workspace
API->>Core: Forward call (no workspace)
Core->>Kernel: Launch kernel
Note over Kernel: cudaMallocAsync (slow!)
Kernel->>GPU: memset workspace to 0
Kernel->>GPU: Launch CUDA kernel
Note over GPU: Execute computation
Note over Kernel: cudaFreeAsync (slow!)
Kernel-->>Caller: Return
Note over Caller: AFTER (This Fix)
Caller->>Caller: Allocate 4-byte workspace once
Caller->>API: Call with workspace parameter
API->>Core: Forward call with workspace
Core->>Core: Extract workspace pointer
Core->>Kernel: Pass workspace to kernel
Note over Kernel: memset workspace to 0 (reuse!)
Kernel->>GPU: Launch CUDA kernel
Note over GPU: Execute computation
Kernel-->>Caller: Return (no deallocation needed)
|
|
/te-ci L1 |
ksivaman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be a lot of CI failures, ptal @zhongbozhu
|
@ksivaman I don't see CI failures from my side, maybe it's because github is not informed super well with the CI status? |
| /*args=*/kernel_args, | ||
| /*rng_state=*/rng_state, /*sm_count=*/sm_count, | ||
| /*rng_state=*/rng_state, | ||
| /*tile_scheduler_workspace=*/tile_scheduler_workspace, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a more generic workspace name to be honest. Proper handling of this would also require having some function that would return size of the required workspace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from API level, it's called quant_workspace now
Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>
transformer_engine/common/include/transformer_engine/hadamard_transform.h
Outdated
Show resolved
Hide resolved
| NVTE_CHECK(quant_workspace.data.buffer_size_bytes() >= sizeof(uint32_t), | ||
| "Quantization workspace must be at least 4 bytes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to be fancy, we could add an option to query the workspace size, similar to how we do it for LayerNorm. If the workspace is not provided, we set the NVTETensor with the required size. This way the caller doesn't need to know the details of the workspace size.
That said, I think this approach is fine for now.
Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
timmoon10
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending CI
|
/te-ci |
Description
Fix this: #2558
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: