Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR adds FP8 scale sweep stride control to calibration workflows, introduces three mixed-precision NVFP4 quantization recipes for Nemotron-3-Super-120B with different calibration methods (MSE, MSE with stride-4 sweep, max-based), refactors MoE calibration completeness checks to recursively traverse SequentialQuantizer leaves, and overhauls HuggingFace export to collect and apply per-layer quantization metadata. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 29-32: The metadata claims attention o_proj is FP8 per-tensor but
quant_cfg lacks any override for attention.o_proj; either update the description
or add the missing quantizer mapping. Fix by adding an explicit quant_cfg
override for the attention o_proj parameter name (e.g., attention.o_proj /
attention.output_projection / whatever key is used in your model mapping) to use
the FP8 per-tensor quantizer used elsewhere, or remove the attention o_proj
mention from the metadata so it matches the existing quant_cfg; ensure you
reference the exact layer key used in quant_cfg to keep mapping consistent with
the model.
- Around line 94-115: The entries using broad quantizer_name patterns
('*mixer.fc1_latent_proj*weight_quantizer',
'*mixer.fc1_latent_proj*input_quantizer',
'*mixer.fc2_latent_proj*weight_quantizer',
'*mixer.fc2_latent_proj*input_quantizer') are enabling FP8 for every latent
projection instead of only layers 1, 3, and 5; update these quantizer_name
values to target only the specific layer instances (e.g. include the layer
index/identifier for layers 1, 3, and 5 in the wildcard or use a regex/explicit
list) so only those mixer.fc1_latent_proj and mixer.fc2_latent_proj quantizers
are set to num_bits: e4m3, and leave all other latent projection quantizers at
BF16 (or remove the generic entries).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 60ed9c0b-efef-4967-b321-7270a8853455
📒 Files selected for processing (1)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1363 +/- ##
===========================================
- Coverage 76.93% 58.90% -18.03%
===========================================
Files 471 471
Lines 50404 53013 +2609
===========================================
- Hits 38776 31227 -7549
- Misses 11628 21786 +10158
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/model_calib.py (1)
389-412:⚠️ Potential issue | 🟡 MinorCustom FP8-sweep backends still ignore the new stride setting.
When a registered
backend_factoryis used, this branch still calls the old 3-argument factory signature, sofp8_scale_sweep_strideonly takes effect on the built-inNVFP4MSECalibratorpath below. That makes the new config silently no-op for registry-backed sweep calibrators. Please extend the factory contract or reject non-default stride explicitly in the backend path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 389 - 412, The registered backend factory path (lookup via _FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls the factory with the old 3-argument signature and thus ignores fp8_scale_sweep_stride; update the branch that sets module._calibrator via backend_factory to either (A) call the factory with the new argument (pass fp8_scale_sweep_stride) and update the factory contract accordingly, or (B) explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a clear error so users know registry-backed calibrators do not support stride; ensure the call still passes initial_amax, module._calibrator._axis, partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
41-42:⚠️ Potential issue | 🟡 MinorThe metadata description still overstates what this recipe quantizes.
These lines say
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but there are no matching overrides inquant_cfg, and the header comments say the latent MoE projections stay BF16. Please update the description so it matches the actual recipe.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 41 - 42, The metadata description in the YAML (the description field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and "fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching overrides in quant_cfg and header comments state latent MoE projections remain BF16/FP16; update the description text to accurately reflect the recipe by removing or changing the FP8 claims (e.g., state that latent MoE projections and those specific projections remain BF16/FP16 and only list the layers that the quant_cfg actually overrides as FP8 per-tensor).modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorThe metadata description does not match the actual quantizer mapping.
These lines still claim
attention o_proj,fc1_latent_proj, andfc2_latent_projare FP8 per-tensor, but this recipe never enables those quantizers, and the header comments above say latent MoE stays BF16. Please align the description with thequant_cfgthat follows.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The YAML description string claims that "attention o_proj, fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in this recipe does not enable quantizers for "attention.o_proj", "fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays BF16/FP16); fix this by either updating the description to state those projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify quant_cfg to actually enable FP8 per-tensor quantizers for the keys "attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment matches the mapping.
🧹 Nitpick comments (1)
modelopt/torch/quantization/calib/mse.py (1)
202-206: Add a regression test for strided FP8 candidate generation.The existing coverage only exercises the default 126-candidate path. This branch adds two new behaviors—subsampling and forced inclusion of the last candidate—so it should have a focused test for
fp8_scale_sweep_stride > 1to lock down both the reduced candidate count and preservation of the max scale.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/calib/mse.py` around lines 202 - 206, The strided FP8 candidate-generation branch guarded by fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into candidates and appends the last value) is untested; add a focused regression test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and asserts that the resulting candidates length is reduced according to the stride and that the final element equals the original fp8_values[-1] (verifying forced inclusion of the max scale); ensure the test covers both subsampling and the append behavior so the branch is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 16-28: Remove the unresolved Git merge conflict markers (<<<<<<<,
=======, >>>>>>>) and restore a single coherent comment block describing the
quant config; keep the more detailed version that lists both HF and
Megatron-Core names (the lines mentioning mixer.experts.<N>.{up,down}_proj,
mlp.experts.local_experts.<N>.linear_fc{1,2},
mixer.shared_experts.{up,down}_proj, and mlp.shared_experts.linear_fc{1,2}) or
merge its additional details into the shorter variant so the YAML comment is
valid and free of conflict markers.
---
Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 389-412: The registered backend factory path (lookup via
_FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls
the factory with the old 3-argument signature and thus ignores
fp8_scale_sweep_stride; update the branch that sets module._calibrator via
backend_factory to either (A) call the factory with the new argument (pass
fp8_scale_sweep_stride) and update the factory contract accordingly, or (B)
explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a
clear error so users know registry-backed calibrators do not support stride;
ensure the call still passes initial_amax, module._calibrator._axis,
partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride
when choosing option A, mirroring how NVFP4MSECalibrator is constructed.
---
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 41-42: The metadata description in the YAML (the description
field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).
---
Nitpick comments:
In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 202-206: The strided FP8 candidate-generation branch guarded by
fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into
candidates and appends the last value) is untested; add a focused regression
test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0d72ce13-180e-46ea-b4d8-8c6c140d22a7
📒 Files selected for processing (5)
modelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)
35-36:⚠️ Potential issue | 🟡 MinorMetadata precision mapping is inconsistent with the actual recipe.
Line 35 says latent MoE
fc1_latent_proj/fc2_latent_projare FP8 per-tensor, but this file has no latent-MoE quantizer overrides and the header (Line 27) says they stay BF16. Please align the description withquant_cfgto avoid confusion.Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with FP8 scale sweep.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around lines 35 - 36, The description's claim that latent MoE "fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the quant_cfg (and header) which leave those layers as BF16; either update the description string to state that fc1_latent_proj and fc2_latent_proj remain BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8 per-tensor (matching the rest of the FP8 settings); ensure the description text and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are aligned.modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)
34-35:⚠️ Potential issue | 🟡 MinorDescription/comments conflict on latent MoE and SSM/mamba precision.
Lines 34-35 state latent MoE is FP8 per-tensor, but the recipe does not enable latent-MoE quantizers. Also, Lines 134-135 conflict with earlier comments on SSM/mamba precision. Please make these statements internally consistent.
Proposed patch
metadata: recipe_type: ptq - description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj - FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep. + description: >- + Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); + shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE, + lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32 + (optionally FP16 in vLLM). Weight-MSE calibration with stride-4 FP8 scale sweep. @@ - # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head. - # SSM state / mamba conv1d stay FP16. + # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head, mamba conv1d. + # SSM state stays FP32 (can be set to FP16 in vLLM).Also applies to: 134-135
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml` around lines 34 - 35, The description text claims "latent MoE is FP8 per-tensor" and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by making the YAML fields match the prose: either add the latent-MoE quantizer key (e.g., include latent_moe in the quantizers list or set enable_latent_moe_quantizers: true) and ensure its precision is set to FP8 per-tensor, or change the description to remove the FP8 latent-MoE claim; likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba precision fields to uniformly state BF16/FP16 (or change the description to reflect the actual configured precisions) so the "description" string and the quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision, mamba_precision, lm_head_precision / mtp_precision) are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 595b4531-3ced-4894-a5cc-c28f161c8f3e
📒 Files selected for processing (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
| # Megatron-Core names: mlp.shared_experts.linear_fc{1,2} | ||
| # - Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor | ||
| # - KV cache: FP8 | ||
| # - Attention linears ({q,k,v}_proj): BF16 (not quantized) |
There was a problem hiding this comment.
Can we double check attention out linear? IIRC, attention o_proj should be FP8.
There was a problem hiding this comment.
responded in slack, only 2/9 attention layers had o_proj FP8 in final Super NVFP4 ckpt, but we can always add it later to test if accuracy degradation is minimal
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
/claude review |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
@claude review |
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)
1213-1229:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRecord packed-expert quant metadata against the module prefix, not the weight key.
In both pack-remap helpers,
prefixis the tensor key written intostate_dict(for example...weight). Recording that verbatim produceslayer_config_dictentries like...weight.quantization, andprocess_layer_quant_config()will then emitquantized_layersnames ending in.weightinstead of the HF module prefix. Serving-side layer matching will miss those packed experts.Suggested fix
- self._record_layer_quant_config(prefix, qformat, block_size) + module_prefix = prefix.rsplit(".", 1)[0] + "." + self._record_layer_quant_config(module_prefix, qformat, block_size)Apply the same normalization in both
_pack_name_remapping()and_pack_name_remapping_gpt_oss().Also applies to: 1280-1298
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/unified_export_megatron.py` around lines 1213 - 1229, The code records quant metadata under the full tensor key (e.g., "...weight") causing layer names to end with ".weight"; update both _pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix before calling self._record_layer_quant_config by stripping the tensor suffix (e.g., remove trailing ".weight" or the last dot component) so the module-level HF prefix is recorded instead of the weight key; apply the same normalization logic where self._record_layer_quant_config(prefix, qformat, block_size) is invoked in both functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml`:
- Around line 16-36: The recipe metadata/comments misstate Latent MOE behavior:
fc1_latent_proj/fc2_latent_proj are described as FP8 per-tensor but the
quant_cfg (and lines noting "stay BF16/FP16") never enable those quantizers;
either update the human-readable description to say Latent MOE projections
remain BF16 (or FP16 per existing comment) to match quant_cfg, or enable FP8
per-tensor quantizers for fc1_latent_proj and fc2_latent_proj in the quant_cfg
so the comment matches behavior—make the change by editing the
metadata/description block and/or the quant_cfg entries that reference
fc1_latent_proj and fc2_latent_proj accordingly.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description erroneously claims the Latent MoE is "FP8
per-tensor" while the recipe does not enable quantizer patterns for
fc1_latent_proj or fc2_latent_proj; update the metadata description string (the
YAML "description" field) to remove or correct the FP8 statement for Latent MoE
(or explicitly state that fc1_latent_proj/fc2_latent_proj remain BF16/FP16) so
it matches the actual quantizer configuration for fc1_latent_proj and
fc2_latent_proj.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 35-36: The description claims latent MoE is FP8 and that static
scales are "chosen by MSE", but the YAML sets no quantizers for
fc1_latent_proj/fc2_latent_proj and uses method: max; update the human-readable
description to match the actual config by either (A) enabling the latent
projection quantizers (fc1_latent_proj/fc2_latent_proj) to make latent MoE FP8,
or (B) explicitly state latent projections remain unquantized (not FP8) and
change the phrase "static scales are chosen by MSE" to reflect the configured
method (e.g., "static scales computed by max" or "method: max"); adjust the
description lines referencing "FP8 per-tensor; ... Latent MOE
fc1_latent_proj/fc2_latent_proj" and the sentence about static scale selection
accordingly so the text and the fields (fc1_latent_proj, fc2_latent_proj, and
method: max) are consistent.
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description string misreports the latent-MoE policy:
update the description (the YAML description field) to reflect that
fc1_latent_proj and fc2_latent_proj are left unquantized (per the quant_cfg)
rather than being "FP8 per-tensor"; edit the text around the FP8/KV/latent-MoE
sentence to explicitly state that fc1_latent_proj/fc2_latent_proj remain
unquantized (or their actual precision) and keep FP8 per-tensor/KV wording only
for the tensors that truly use FP8.
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 815-818: The current check only treats qformat is None as
excluded, but QUANTIZATION_NONE should be treated the same so those layers are
recorded as excluded (and not dropped from the exported quant config); update
the conditional in unified_export_megatron.py where qformat is obtained from
_get_quantization_format(module) to also consider QUANTIZATION_NONE (e.g., if
qformat is None or qformat == QUANTIZATION_NONE) before calling
_record_excluded_module(prefix), and ensure any references to
_record_layer_quant_config still see these explicitly disabled quantizers as
excluded rather than omitted.
---
Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1213-1229: The code records quant metadata under the full tensor
key (e.g., "...weight") causing layer names to end with ".weight"; update both
_pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix
before calling self._record_layer_quant_config by stripping the tensor suffix
(e.g., remove trailing ".weight" or the last dot component) so the module-level
HF prefix is recorded instead of the weight key; apply the same normalization
logic where self._record_layer_quant_config(prefix, qformat, block_size) is
invoked in both functions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3adef0ee-1161-4068-a796-fc857c2455fd
📒 Files selected for processing (11)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)
120-122: 💤 Low value
_get_static_global_amaxis called twice for the static branch.
_is_static_quantizerinternally calls_get_static_global_amaxand discards the result; line 122 then calls it again. Consider hoisting to a single variable:♻️ Proposed refactor
- if cls._is_static_quantizer(weight_quantizer): - # Static path: use pre-computed per-block amax values from quantizer - global_amax = cls._get_static_global_amax(weight_quantizer).float() + global_amax = cls._get_static_global_amax(weight_quantizer) + if global_amax is not None: + # Static path: use pre-computed per-block amax values from quantizer + global_amax = global_amax.float()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 120 - 122, Hoist the call to _get_static_global_amax so it's executed only once: call amax = cls._get_static_global_amax(weight_quantizer) (and .float() as needed) and reuse that value instead of calling _get_static_global_amax again; update _is_static_quantizer to accept an optional precomputed amax parameter or refactor _is_static_quantizer to avoid calling _get_static_global_amax internally so the static branch uses the hoisted amax (refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).tests/gpu_megatron/torch/export/test_unified_export_megatron.py (1)
151-164: ⚡ Quick winAdd one end-to-end mixed-precision export case.
The new export path here is the per-layer
layer_config_dict/process_layer_quant_configflow, but this matrix still only exercises uniform configs (NVFP4_DEFAULT_CFG/FP8_DEFAULT_CFG). A single mixed recipe case would catch regressions inquantized_layers, excludes, and theconfig.json↔hf_quant_config.jsonparity you just tightened.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines 151 - 164, Add a new parametrized test case to the existing pytest.mark.parametrize matrix (the tuple of ("model_type", "arch", "extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer mixed-precision export path: supply a quant_config that triggers the layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json parity; update the matrix alongside existing entries (referencing the parameters model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a mixed per-layer config for either nemotron or llama.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 45-79: The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
---
Nitpick comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 120-122: Hoist the call to _get_static_global_amax so it's
executed only once: call amax = cls._get_static_global_amax(weight_quantizer)
(and .float() as needed) and reuse that value instead of calling
_get_static_global_amax again; update _is_static_quantizer to accept an optional
precomputed amax parameter or refactor _is_static_quantizer to avoid calling
_get_static_global_amax internally so the static branch uses the hoisted amax
(refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).
In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 151-164: Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 94f3528f-2ee4-4757-b8de-3a323300ca74
📒 Files selected for processing (10)
modelopt/torch/export/unified_export_megatron.pymodelopt/torch/quantization/calib/mse.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/qtensor/nvfp4_tensor.pymodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yamlmodelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yamltests/gpu_megatron/torch/export/test_unified_export_megatron.py
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | ||
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | ||
| - quantizer_name: '*mixer.experts.*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mixer.experts.*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | ||
| - quantizer_name: '*mlp.experts*weight_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 | ||
| - quantizer_name: '*mlp.experts*input_quantizer' | ||
| enable: true | ||
| cfg: | ||
| block_sizes: | ||
| -1: 16 | ||
| type: dynamic | ||
| scale_bits: e4m3 | ||
| num_bits: e2m1 |
There was a problem hiding this comment.
Keep routed-expert weights static in the max-calib variant.
These two *...weight_quantizer entries are type: dynamic, so this recipe does more than swap method: mse for method: max—it changes routed-expert weights to dynamic NVFP4 and bypasses the static-weight calibration path entirely. That makes this variant a different quantization recipe, not a max-calibrated version of super-nvfp4.yaml.
Suggested fix
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1
@@
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
- type: dynamic
+ type: static
scale_bits: e4m3
num_bits: e2m1📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale. | |
| # HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj. | |
| - quantizer_name: '*mixer.experts.*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mixer.experts.*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| # Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}. | |
| - quantizer_name: '*mlp.experts*weight_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: static | |
| scale_bits: e4m3 | |
| num_bits: e2m1 | |
| - quantizer_name: '*mlp.experts*input_quantizer' | |
| enable: true | |
| cfg: | |
| block_sizes: | |
| -1: 16 | |
| type: dynamic | |
| scale_bits: e4m3 | |
| num_bits: e2m1 |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 45 - 79, The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.
| if hasattr(self, "kv_cache_dtype"): | ||
| self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype | ||
| # Use one serving-facing config for both hf_quant_config.json and config.json. | ||
| self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config) |
There was a problem hiding this comment.
do we change the format of hf_quant_config.json by this change?
What does this PR do?
Type of change: New recipe
hf_quant_config.json_global_amaxfield inNVFP4QTensorstatic quantizer detection -> fixes bug during MCore export for MSEblock_sizesis dict-backed.fp8_scale_sweep_strideto optionally subsample NVFP4 FP8 scale sweep candidates.Why
The dynamic block check previously used attribute access and failed for dict-backed
block_sizes, so dynamic block quantizers could incorrectly enter MoE amax completeness/syncpaths. The FP8 sweep stride keeps default exhaustive behavior while giving recipes a controlled way to reduce NVFP4 weight scale search time.
Testing
python3 -m py_compile modelopt/torch/quantization/model_calib.pygit diff --check -- modelopt/torch/quantization/model_calib.pySuper recipe
Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:
rest: not quantized
Usage
# Add a code snippet demonstrating how to use thisTesting
TODO test in HF and MCore PTQ on Nemotron model
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Improvements
Tests