fixes for fused moe (qwen3.6, GLM5.1 + MSE calibration#1382
fixes for fused moe (qwen3.6, GLM5.1 + MSE calibration#1382
Conversation
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR improves mixture-of-experts (MoE) quantization safety and calibration, introduces per-expert weight calibration support, adds a new NVFP4 experts-only PTQ recipe, and refines LLM PTQ preview input handling. It includes FP8 scale underflow prevention, safer CUDA tensor handling during MoE export, per-expert calibration discovery, and display formatting updates. ChangesMoE Quantization Infrastructure & Experts-Only PTQ Recipe
LLM PTQ Preview Input Handling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/moe_utils.py`:
- Around line 98-103: The temporary mutation of w_quantizer_src._amax before
calling copy.deepcopy may leave the source quantizer with _amax == None if
deepcopy raises; change the code around copy.deepcopy(w_quantizer_src) to save
_saved_amax, set w_quantizer_src._amax = None, then perform deepcopy inside a
try block and restore w_quantizer_src._amax = _saved_amax in a finally block;
after deepcopy set w_quantizer._amax = gu_amax_cpu as before so the source state
is always restored even on exceptions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c7efeb50-0d25-4ef7-8b84-e1a0a66662b4
📒 Files selected for processing (7)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/moe_utils.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/general/ptq/nvfp4_experts_only_mse.yaml
| if is_gate_up: | ||
| _saved_amax = getattr(w_quantizer_src, "_amax", None) | ||
| w_quantizer_src._amax = None | ||
| w_quantizer = copy.deepcopy(w_quantizer_src) | ||
| w_quantizer_src._amax = _saved_amax | ||
| w_quantizer._amax = gu_amax_cpu |
There was a problem hiding this comment.
Protect temporary _amax mutation with try/finally.
If copy.deepcopy() raises, _amax is left as None on the source quantizer. Wrap restore in finally to avoid state corruption on failure.
Proposed fix
if is_gate_up:
_saved_amax = getattr(w_quantizer_src, "_amax", None)
- w_quantizer_src._amax = None
- w_quantizer = copy.deepcopy(w_quantizer_src)
- w_quantizer_src._amax = _saved_amax
+ w_quantizer_src._amax = None
+ try:
+ w_quantizer = copy.deepcopy(w_quantizer_src)
+ finally:
+ w_quantizer_src._amax = _saved_amax
w_quantizer._amax = gu_amax_cpu🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/export/moe_utils.py` around lines 98 - 103, The temporary
mutation of w_quantizer_src._amax before calling copy.deepcopy may leave the
source quantizer with _amax == None if deepcopy raises; change the code around
copy.deepcopy(w_quantizer_src) to save _saved_amax, set w_quantizer_src._amax =
None, then perform deepcopy inside a try block and restore w_quantizer_src._amax
= _saved_amax in a finally block; after deepcopy set w_quantizer._amax =
gu_amax_cpu as before so the source state is always restored even on exceptions.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
This PR fixes several real bugs in the fused MoE quantization pipeline (MSE calibration discovery, FP8 scale underflow, uncalibrated expert export). The fixes are well-described in the PR body and address genuine correctness issues. However, there are several concerns:
-
Missing unit tests (critical): No tests are added for any of the bug fixes. The existing
test_fused_experts.pycovers registration/conversion/basic export but doesn't exercise MSE calibration for fused experts, FP8 scale clamping, or the invalid-amax patching logic. Given the complexity of the moe_utils.py changes and the project's known pattern of missing tests, this is a blocking concern. -
Threshold inconsistency:
_MIN_VALID_AMAX = 1e-4is below FP8 E4M3FN minimum (2^-9 ≈ 0.00195), meaning values between 1e-4 and 2e-3 pass the validity check but could still underflow. -
Hardcoded block_size=16: The fallback per-block amax computation in moe_utils.py hardcodes
16. If the actual block size differs, the shape will be wrong. -
Copyright year: New YAML file has
Copyright (c) 2024but LICENSE_HEADER says 2026.
| ) | ||
|
|
||
| # If the weight quantizer was never calibrated, compute amax from weights. | ||
| # Patch invalid per-block amax entries (NaN/inf/negative/zero/too-small/too-large) |
There was a problem hiding this comment.
Bot comment.
Bug: _MIN_VALID_AMAX = 1e-4 is below the FP8 E4M3FN minimum subnormal (2^-9 ≈ 0.00195). Values between 1e-4 and ~0.00195 will pass this validity check but will still underflow to 0 when cast to FP8 E4M3FN. Consider using 2e-3 (which you already use for clamping) or 2**-9 as the minimum valid threshold for consistency with the nvfp4_tensor.py fix.
| ) | ||
| if invalid_mask.any(): | ||
| per_block_fallback = ( | ||
| weight_slice.detach() |
There was a problem hiding this comment.
Bot comment.
Hardcoded block_size=16 here and at line 173. If the quantizer's actual block size is different, the reshape will produce an incorrect shape. Consider extracting the block size from the weight quantizer (e.g. w_quantizer.block_sizes.get(-1, 16)) rather than hardcoding.
| torch.cuda.synchronize(amax.device) | ||
| return amax.detach().cpu().float() | ||
| except Exception: | ||
| return None |
There was a problem hiding this comment.
Bot comment.
Bare except Exception silently swallows all errors and returns None. While defensive coding for corrupt CUDA tensors is reasonable, this could mask unrelated bugs. Consider catching a narrower set of exceptions (e.g. RuntimeError) or at minimum logging a warning when the fallback is triggered:
except RuntimeError:
warnings.warn(f"Failed to extract _amax to CPU for {quantizer_src}, using fallback")
return None| # the corrupt CUDA tensor, then inject the pre-extracted CPU amax. | ||
| if is_gate_up: | ||
| _saved_amax = getattr(w_quantizer_src, "_amax", None) | ||
| w_quantizer_src._amax = None |
There was a problem hiding this comment.
Bot comment.
For down_proj (non-gate_up), w_quantizer = w_quantizer_src — this is the original quantizer, not a copy. Then w_quantizer._amax = down_amax_cpu mutates the original quantizer's _amax. This is fine if the module is only exported once, but is potentially surprising. A comment noting this is intentional mutation of the original would help.
| @@ -0,0 +1,130 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Bot comment.
Copyright year is 2024 but the project's LICENSE_HEADER specifies 2026. New files should use the current year from the canonical header.
| # per-expert quantizer individually. | ||
| for param_name, _ in parent_module.named_parameters(recurse=False): | ||
| qlist = getattr(parent_module, f"{param_name}_weight_quantizers", None) | ||
| if not isinstance(qlist, nn.ModuleList): |
There was a problem hiding this comment.
Bot comment.
The pattern named_parameters(recurse=False) + checking for f"{param_name}_weight_quantizers" works for the current _QuantFusedExperts layout, but is fairly fragile. If other modules happen to have a parameter and a same-named ModuleList with _weight_quantizers suffix, they'd be picked up too. Consider adding a type check (e.g. checking if parent_module is a _QuantFusedExperts instance) or at least a comment noting the assumption.
| "input_features" if model_type == "whisper" else "input_ids" | ||
| ][0:1] | ||
| # Strip leading padding tokens so the preview input shows real content | ||
| if model_type not in ("whisper",) and tokenizer is not None and tokenizer.pad_token_id is not None: |
There was a problem hiding this comment.
Bot comment.
Minor: if preview_input_ids has no non-pad tokens (e.g. all tokens are padding), first_non_pad will be empty and first_non_pad[0] will error. The first_non_pad.numel() > 0 check correctly guards this — just confirming it's intentional that the original (all-padding) input is preserved in that edge case.
What does this PR do?
Type of change: Bug fix
Fixes several issues with NVFP4 MSE calibration and export for fused MoE expert modules (_QuantFusedExperts — used by Qwen3.6, GLM-5.1, and other HF transformers 5.0+ models that store expert weights as 3-D nn.Parameters).
The weight-quantizer discovery loop in mse_calibrate used the singular attribute name gate_up_proj_weight_quantizer to look up quantizers, but _QuantFusedExperts stores them in a plural nn.ModuleList named gate_up_proj_weight_quantizers. All 20,480 expert quantizers were silently skipped, resulting in "MSE weight calibration: 0it" and no MSE-optimized scales.
Fix: add a second pass that detects plural {param}_weight_quantizers ModuleLists and enqueues each per-expert quantizer with a (param_name, expert_idx) tuple; step 3 unpacks the tuple to extract the per-expert weight slice.
Per-block weight scales can silently underflow to 0 when cast to FP8 E4M3FN. The existing scale == 0 guard only catches exact float32 zeros; values in (0, 2^-9) pass through and become 0 after the FP8 cast. This affects both the dynamic recompute path (get_weights_scaling_factor) and the static calibrated path (get_weights_scaling_factor_from_quantizer).
Fix: clamp per-block scales to 2^-9 (smallest positive FP8 E4M3FN subnormal) before the FP8 cast in both paths.
Experts that receive no tokens during calibration have _amax = 0 or uninitialized values. The existing scalar fallback used 1e-4 which itself underflows to 0 in FP8 E4M3FN (1e-4 < 2^-9 ≈ 0.00195). Additionally, the per-block fallback tensor had shape (H*W, 1) instead of (H, W), causing a shape mismatch that silently bypassed the fallback and fell through to the bad scalar. Finally, a stale zero global_amax from an uncalibrated expert was not recomputed, causing division-by-zero in the FP8 scale formula.
Fix: reshape the per-block fallback correctly; raise the clamp floor to 2e-3; always recompute global_amax from the current (possibly patched) per-block _amax.
Additional fixes:
Usage
Testing
validated on Qwen3.6-35B-A3B (8× B200):
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Bug Fixes
Improvements