Skip to content

Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363

Open
jenchen13 wants to merge 15 commits intomainfrom
jennifchen/super_nvfp4_recipe
Open

Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe#1363
jenchen13 wants to merge 15 commits intomainfrom
jennifchen/super_nvfp4_recipe

Conversation

@jenchen13
Copy link
Copy Markdown
Contributor

@jenchen13 jenchen13 commented Apr 28, 2026

What does this PR do?

Type of change: New recipe

  • Add a YAML quantization recipe that roughly mirrors Nemotron 3 Super NVFP4 hf_quant_config.json
  • support mixed precision export in MCore by detecting mixed precision layers in HF Quant Config
  • Support MCore checkpoint resumed TensorQuantizer with _global_amax field in NVFP4QTensor static quantizer detection -> fixes bug during MCore export for MSE
  • Fix dynamic block quantizer detection when block_sizes is dict-backed.
  • Skip dynamic block quantizers during MoE calibration completeness checks and distributed amax sync.
  • Add fp8_scale_sweep_stride to optionally subsample NVFP4 FP8 scale sweep candidates.

Why

The dynamic block check previously used attribute access and failed for dict-backed block_sizes, so dynamic block quantizers could incorrectly enter MoE amax completeness/sync
paths. The FP8 sweep stride keeps default exhaustive behavior while giving recipes a controlled way to reduce NVFP4 weight scale search time.

Testing

  • python3 -m py_compile modelopt/torch/quantization/model_calib.py
  • git diff --check -- modelopt/torch/quantization/model_calib.py

Super recipe

Mirrors the published nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 hf_quant_config.json:

  • MoE routed experts (mixer.experts..{up,down}_proj): NVFP4 W4A4 weight MSE, group_size 16
  • MoE shared experts (mixer.shared_experts.{up,down}_proj): FP8 per-tensor
  • Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
  • KV cache: FP8
    rest: not quantized

Usage

# Add a code snippet demonstrating how to use this

Testing

TODO test in HF and MCore PTQ on Nemotron model

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Added three new quantization recipes for Nemotron-3-Super-120B with NVFP4 and FP8 calibration strategies.
    • Added configurable FP8 scale sweep stride parameter for fine-tuning quantization calibration.
    • Improved per-layer quantization metadata collection during model export.
  • Improvements

    • Enhanced mixed-precision quantization and MoE expert handling across distributed processes.
    • Refined KV-cache and attention layer quantization configuration export.
  • Tests

    • Updated quantization export verification tests.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from a team as a code owner April 28, 2026 18:33
@jenchen13 jenchen13 requested a review from h-guo18 April 28, 2026 18:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds FP8 scale sweep stride control to calibration workflows, introduces three mixed-precision NVFP4 quantization recipes for Nemotron-3-Super-120B with different calibration methods (MSE, MSE with stride-4 sweep, max-based), refactors MoE calibration completeness checks to recursively traverse SequentialQuantizer leaves, and overhauls HuggingFace export to collect and apply per-layer quantization metadata.

Changes

Cohort / File(s) Summary
YAML Quantization Recipes
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml, modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml, modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
Three new mixed-precision quantization recipes: MSE calibration with FP8 scale sweep (stride 1 and 4), and max-based calibration. Each selectively enables NVFP4 W4A4 for routed MoE experts, FP8 for shared experts/Mamba components, and KV-cache BMM quantizers.
FP8 Scale Sweep Stride Configuration
modelopt/torch/quantization/config.py, modelopt/torch/quantization/calib/mse.py, modelopt/torch/quantization/model_calib.py
Adds fp8_scale_sweep_stride parameter to MseCalibConfig, NVFP4MSECalibrator, and mse_calibrate() function to subsample FP8 E4M3 scale candidates during calibration (minimum value 1).
MoE Calibration and Synchronization
modelopt/torch/quantization/model_calib.py
Refactors MoE completeness checks and DP/EP _amax synchronization to recursively traverse SequentialQuantizer leaves, consistently detect dynamic block quantizers via shared helper, and skip dynamic-block leaves from synchronization.
NVFP4 Static Quantizer Detection
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
Introduces _get_static_global_amax helper supporting both dict- and attribute-style access to weight_quantizer.global_amax, replacing direct attribute lookups across scaling factor methods.
Megatron Plugin Sharding
modelopt/torch/quantization/plugins/megatron.py
Filters weight_quantizer._global_amax entries from sharded axis mapping to exclude only global-amax tensors from axis assignment.
HF Export Layer Metadata
modelopt/torch/export/unified_export_megatron.py
Adds per-layer quantization metadata collection during export, refactors exclusion handling via centralized _record_excluded_module helper, enhances QKV handling for unquantized cases with separate projection exclusions, and updates save_pretrained to apply per-layer quantization config when available.
Export Test Updates
tests/gpu_megatron/torch/export/test_unified_export_megatron.py
Revises quantization config verification to require whole-structure equality, updates HF config validation paths (e.g., config_groups/group_0/weights/group_size for NVFP4 group size), and adds unit test for unquantized QKV slicing and derived projection module exclusions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main focus: mixed-precision support, static MSE PTQ in MCore export, and a new Nemotron Super v3 NVFP4 recipe, all of which are reflected throughout the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Pull request does not introduce any critical security anti-patterns defined in SECURITY.md.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jennifchen/super_nvfp4_recipe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 29-32: The metadata claims attention o_proj is FP8 per-tensor but
quant_cfg lacks any override for attention.o_proj; either update the description
or add the missing quantizer mapping. Fix by adding an explicit quant_cfg
override for the attention o_proj parameter name (e.g., attention.o_proj /
attention.output_projection / whatever key is used in your model mapping) to use
the FP8 per-tensor quantizer used elsewhere, or remove the attention o_proj
mention from the metadata so it matches the existing quant_cfg; ensure you
reference the exact layer key used in quant_cfg to keep mapping consistent with
the model.
- Around line 94-115: The entries using broad quantizer_name patterns
('*mixer.fc1_latent_proj*weight_quantizer',
'*mixer.fc1_latent_proj*input_quantizer',
'*mixer.fc2_latent_proj*weight_quantizer',
'*mixer.fc2_latent_proj*input_quantizer') are enabling FP8 for every latent
projection instead of only layers 1, 3, and 5; update these quantizer_name
values to target only the specific layer instances (e.g. include the layer
index/identifier for layers 1, 3, and 5 in the wildcard or use a regex/explicit
list) so only those mixer.fc1_latent_proj and mixer.fc2_latent_proj quantizers
are set to num_bits: e4m3, and leave all other latent projection quantizers at
BF16 (or remove the generic entries).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 60ed9c0b-efef-4967-b321-7270a8853455

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 2a0c852.

📒 Files selected for processing (1)
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated
Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated
@jenchen13 jenchen13 requested a review from realAsma April 28, 2026 18:40
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 30.00000% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.90%. Comparing base (8eec6d4) to head (5de5541).
⚠️ Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/unified_export_megatron.py 5.66% 50 Missing ⚠️
modelopt/torch/quantization/model_calib.py 61.53% 10 Missing ⚠️
modelopt/torch/quantization/calib/mse.py 0.00% 6 Missing ⚠️
modelopt/torch/quantization/plugins/megatron.py 0.00% 2 Missing ⚠️
...odelopt/torch/quantization/qtensor/nvfp4_tensor.py 83.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1363       +/-   ##
===========================================
- Coverage   76.93%   58.90%   -18.03%     
===========================================
  Files         471      471               
  Lines       50404    53013     +2609     
===========================================
- Hits        38776    31227     -7549     
- Misses      11628    21786    +10158     
Flag Coverage Δ
examples 36.95% <27.83%> (-3.72%) ⬇️
gpu 15.78% <19.58%> (-44.38%) ⬇️
regression 14.91% <9.27%> (+0.20%) ⬆️
unit 52.80% <19.38%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jenchen13 jenchen13 changed the title Add nemotron Super v3 NVFP4 PTQ recipe Add Nemotron Super v3 NVFP4 PTQ recipe Apr 28, 2026
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from a team as a code owner April 29, 2026 19:11
@jenchen13 jenchen13 changed the title Add Nemotron Super v3 NVFP4 PTQ recipe Fix dynamic block quantizer detection & MSE MOE calibration; Add Nemotron Super v3 NVFP4 PTQ recipe Apr 29, 2026
@jenchen13 jenchen13 requested review from Fridah-nv and meenchen April 29, 2026 19:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 29, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1363/

Built to branch gh-pages at 2026-05-01 21:08 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/quantization/model_calib.py (1)

389-412: ⚠️ Potential issue | 🟡 Minor

Custom FP8-sweep backends still ignore the new stride setting.

When a registered backend_factory is used, this branch still calls the old 3-argument factory signature, so fp8_scale_sweep_stride only takes effect on the built-in NVFP4MSECalibrator path below. That makes the new config silently no-op for registry-backed sweep calibrators. Please extend the factory contract or reject non-default stride explicitly in the backend path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 389 - 412, The
registered backend factory path (lookup via _FP8_SWEEP_CALIBRATOR_REGISTRY and
assigned to backend_factory) currently calls the factory with the old 3-argument
signature and thus ignores fp8_scale_sweep_stride; update the branch that sets
module._calibrator via backend_factory to either (A) call the factory with the
new argument (pass fp8_scale_sweep_stride) and update the factory contract
accordingly, or (B) explicitly detect a non-default fp8_scale_sweep_stride and
raise/rollback with a clear error so users know registry-backed calibrators do
not support stride; ensure the call still passes initial_amax,
module._calibrator._axis, partial(_mse_quant_func, quantizer=module) and include
fp8_scale_sweep_stride when choosing option A, mirroring how NVFP4MSECalibrator
is constructed.
♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)

41-42: ⚠️ Potential issue | 🟡 Minor

The metadata description still overstates what this recipe quantizes.

These lines say attention o_proj, fc1_latent_proj, and fc2_latent_proj are FP8 per-tensor, but there are no matching overrides in quant_cfg, and the header comments say the latent MoE projections stay BF16. Please update the description so it matches the actual recipe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 41 - 42, The metadata description in the YAML (the description field)
incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)

34-35: ⚠️ Potential issue | 🟡 Minor

The metadata description does not match the actual quantizer mapping.

These lines still claim attention o_proj, fc1_latent_proj, and fc2_latent_proj are FP8 per-tensor, but this recipe never enables those quantizers, and the header comments above say latent MoE stays BF16. Please align the description with the quant_cfg that follows.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`
around lines 34 - 35, The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.
🧹 Nitpick comments (1)
modelopt/torch/quantization/calib/mse.py (1)

202-206: Add a regression test for strided FP8 candidate generation.

The existing coverage only exercises the default 126-candidate path. This branch adds two new behaviors—subsampling and forced inclusion of the last candidate—so it should have a focused test for fp8_scale_sweep_stride > 1 to lock down both the reduced candidate count and preservation of the max scale.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/calib/mse.py` around lines 202 - 206, The strided
FP8 candidate-generation branch guarded by fp8_scale_sweep_stride > 1 (in the
block that subsamples fp8_values into candidates and appends the last value) is
untested; add a focused regression test (e.g.,
test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 16-28: Remove the unresolved Git merge conflict markers (<<<<<<<,
=======, >>>>>>>) and restore a single coherent comment block describing the
quant config; keep the more detailed version that lists both HF and
Megatron-Core names (the lines mentioning mixer.experts.<N>.{up,down}_proj,
mlp.experts.local_experts.<N>.linear_fc{1,2},
mixer.shared_experts.{up,down}_proj, and mlp.shared_experts.linear_fc{1,2}) or
merge its additional details into the shorter variant so the YAML comment is
valid and free of conflict markers.

---

Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 389-412: The registered backend factory path (lookup via
_FP8_SWEEP_CALIBRATOR_REGISTRY and assigned to backend_factory) currently calls
the factory with the old 3-argument signature and thus ignores
fp8_scale_sweep_stride; update the branch that sets module._calibrator via
backend_factory to either (A) call the factory with the new argument (pass
fp8_scale_sweep_stride) and update the factory contract accordingly, or (B)
explicitly detect a non-default fp8_scale_sweep_stride and raise/rollback with a
clear error so users know registry-backed calibrators do not support stride;
ensure the call still passes initial_amax, module._calibrator._axis,
partial(_mse_quant_func, quantizer=module) and include fp8_scale_sweep_stride
when choosing option A, mirroring how NVFP4MSECalibrator is constructed.

---

Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The YAML description string claims that "attention o_proj,
fc1_latent_proj, and fc2_latent_proj" are FP8 per-tensor, but the quant_cfg in
this recipe does not enable quantizers for "attention.o_proj",
"fc1_latent_proj", or "fc2_latent_proj" (and the header notes latent MoE stays
BF16/FP16); fix this by either updating the description to state those
projections remain BF16/FP16 (and remove the FP8 per-tensor claim) or modify
quant_cfg to actually enable FP8 per-tensor quantizers for the keys
"attention.o_proj", "fc1_latent_proj", and "fc2_latent_proj" so the comment
matches the mapping.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 41-42: The metadata description in the YAML (the description
field) incorrectly claims that "attention o_proj", "fc1_latent_proj", and
"fc2_latent_proj" are FP8 per-tensor while the recipe does not contain matching
overrides in quant_cfg and header comments state latent MoE projections remain
BF16/FP16; update the description text to accurately reflect the recipe by
removing or changing the FP8 claims (e.g., state that latent MoE projections and
those specific projections remain BF16/FP16 and only list the layers that the
quant_cfg actually overrides as FP8 per-tensor).

---

Nitpick comments:
In `@modelopt/torch/quantization/calib/mse.py`:
- Around line 202-206: The strided FP8 candidate-generation branch guarded by
fp8_scale_sweep_stride > 1 (in the block that subsamples fp8_values into
candidates and appends the last value) is untested; add a focused regression
test (e.g., test_fp8_scale_sweep_stride_preserves_last_candidate) that sets
fp8_scale_sweep_stride > 1, calls the code path that produces fp8_values, and
asserts that the resulting candidates length is reduced according to the stride
and that the final element equals the original fp8_values[-1] (verifying forced
inclusion of the max scale); ensure the test covers both subsampling and the
append behavior so the branch is locked down.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0d72ce13-180e-46ea-b4d8-8c6c140d22a7

📥 Commits

Reviewing files that changed from the base of the PR and between 9282cdb and f796197.

📒 Files selected for processing (5)
  • modelopt/torch/quantization/calib/mse.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml (1)

35-36: ⚠️ Potential issue | 🟡 Minor

Metadata precision mapping is inconsistent with the actual recipe.

Line 35 says latent MoE fc1_latent_proj/fc2_latent_proj are FP8 per-tensor, but this file has no latent-MoE quantizer overrides and the header (Line 27) says they stay BF16. Please align the description with quant_cfg to avoid confusion.

Proposed patch
 metadata:
   recipe_type: ptq
-  description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
-    FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with FP8 scale sweep.
+  description: >-
+    Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16);
+    shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE,
+    lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32
+    (optionally FP16 in vLLM). Weight-MSE calibration with FP8 scale sweep.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml` around
lines 35 - 36, The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.
modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml (1)

34-35: ⚠️ Potential issue | 🟡 Minor

Description/comments conflict on latent MoE and SSM/mamba precision.

Lines 34-35 state latent MoE is FP8 per-tensor, but the recipe does not enable latent-MoE quantizers. Also, Lines 134-135 conflict with earlier comments on SSM/mamba precision. Please make these statements internally consistent.

Proposed patch
 metadata:
   recipe_type: ptq
-  description: Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16); shared experts, mamba in/out_proj, and Latent MOE fc1_latent_proj/fc2_latent_proj
-    FP8 per-tensor; FP8 KV cache; lm_head/MTP/SSM stay BF16/FP16. Weight-MSE calibration with stride-4 FP8 scale sweep.
+  description: >-
+    Super NVFP4 mixed precision — sparse MoE experts NVFP4 (W4A4, group_size 16);
+    shared experts and mamba in/out_proj FP8 per-tensor; FP8 KV cache; latent MoE,
+    lm_head, MTP, output, and mamba conv1d stay BF16; SSM cache stays FP32
+    (optionally FP16 in vLLM). Weight-MSE calibration with stride-4 FP8 scale sweep.
@@
-    # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head.
-    # SSM state / mamba conv1d stay FP16.
+    # Stay BF16: lm_head, output projection, MoE routers/gates, MTP head, mamba conv1d.
+    # SSM state stays FP32 (can be set to FP16 in vLLM).

Also applies to: 134-135

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`
around lines 34 - 35, The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description text claims "latent MoE is FP8 per-tensor"
and that "lm_head/MTP/SSM stay BF16/FP16" but the recipe doesn't enable
latent-MoE quantizers and later lines conflict on SSM/mamba precision; fix by
making the YAML fields match the prose: either add the latent-MoE quantizer key
(e.g., include latent_moe in the quantizers list or set
enable_latent_moe_quantizers: true) and ensure its precision is set to FP8
per-tensor, or change the description to remove the FP8 latent-MoE claim;
likewise reconcile SSM/mamba entries by updating the lm_head/MTP/SSM and mamba
precision fields to uniformly state BF16/FP16 (or change the description to
reflect the actual configured precisions) so the "description" string and the
quantizer/precision keys (latent_moe_quantizers, quantizers list, ssm_precision,
mamba_precision, lm_head_precision / mtp_precision) are consistent.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description's claim that latent MoE
"fc1_latent_proj/fc2_latent_proj" are FP8 per-tensor is inconsistent with the
quant_cfg (and header) which leave those layers as BF16; either update the
description string to state that fc1_latent_proj and fc2_latent_proj remain
BF16/FP16 to match quant_cfg, or add explicit quantizer overrides in quant_cfg
for the fc1_latent_proj and fc2_latent_proj modules to set them to FP8
per-tensor (matching the rest of the FP8 settings); ensure the description text
and the quant_cfg entries (module names fc1_latent_proj, fc2_latent_proj) are
aligned.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 595b4531-3ced-4894-a5cc-c28f161c8f3e

📥 Commits

Reviewing files that changed from the base of the PR and between f796197 and 81d9d87.

📒 Files selected for processing (2)
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
# Megatron-Core names: mlp.shared_experts.linear_fc{1,2}
# - Mamba mixer linears (mixer.{in,out}_proj): FP8 per-tensor
# - KV cache: FP8
# - Attention linears ({q,k,v}_proj): BF16 (not quantized)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we double check attention out linear? IIRC, attention o_proj should be FP8.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responded in slack, only 2/9 attention layers had o_proj FP8 in final Super NVFP4 ckpt, but we can always add it later to test if accuracy degradation is minimal

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested a review from a team as a code owner May 1, 2026 19:29
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13 jenchen13 changed the title Fix dynamic block quantizer detection & MSE MOE calibration; Add Nemotron Super v3 NVFP4 PTQ recipe Support Mixed precision & Static MSE PTQ in MCore export; Add Nemotron Super v3 NVFP4 PTQ recipe May 1, 2026
@jenchen13 jenchen13 changed the title Support Mixed precision & Static MSE PTQ in MCore export; Add Nemotron Super v3 NVFP4 PTQ recipe Support Mixed precision & Static MSE PTQ in MCore export; Nemotron Super v3 NVFP4 recipe May 1, 2026
@jenchen13
Copy link
Copy Markdown
Contributor Author

/claude review

@jenchen13
Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

✅ Actions performed

Full review triggered.

@jenchen13
Copy link
Copy Markdown
Contributor Author

@claude review

Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/export/unified_export_megatron.py (1)

1213-1229: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Record packed-expert quant metadata against the module prefix, not the weight key.

In both pack-remap helpers, prefix is the tensor key written into state_dict (for example ...weight). Recording that verbatim produces layer_config_dict entries like ...weight.quantization, and process_layer_quant_config() will then emit quantized_layers names ending in .weight instead of the HF module prefix. Serving-side layer matching will miss those packed experts.

Suggested fix
-            self._record_layer_quant_config(prefix, qformat, block_size)
+            module_prefix = prefix.rsplit(".", 1)[0] + "."
+            self._record_layer_quant_config(module_prefix, qformat, block_size)

Apply the same normalization in both _pack_name_remapping() and _pack_name_remapping_gpt_oss().

Also applies to: 1280-1298

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/unified_export_megatron.py` around lines 1213 - 1229,
The code records quant metadata under the full tensor key (e.g., "...weight")
causing layer names to end with ".weight"; update both _pack_name_remapping and
_pack_name_remapping_gpt_oss to normalize the prefix before calling
self._record_layer_quant_config by stripping the tensor suffix (e.g., remove
trailing ".weight" or the last dot component) so the module-level HF prefix is
recorded instead of the weight key; apply the same normalization logic where
self._record_layer_quant_config(prefix, qformat, block_size) is invoked in both
functions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml`:
- Around line 16-36: The recipe metadata/comments misstate Latent MOE behavior:
fc1_latent_proj/fc2_latent_proj are described as FP8 per-tensor but the
quant_cfg (and lines noting "stay BF16/FP16") never enable those quantizers;
either update the human-readable description to say Latent MOE projections
remain BF16 (or FP16 per existing comment) to match quant_cfg, or enable FP8
per-tensor quantizers for fc1_latent_proj and fc2_latent_proj in the quant_cfg
so the comment matches behavior—make the change by editing the
metadata/description block and/or the quant_cfg entries that reference
fc1_latent_proj and fc2_latent_proj accordingly.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml`:
- Around line 34-35: The description erroneously claims the Latent MoE is "FP8
per-tensor" while the recipe does not enable quantizer patterns for
fc1_latent_proj or fc2_latent_proj; update the metadata description string (the
YAML "description" field) to remove or correct the FP8 statement for Latent MoE
(or explicitly state that fc1_latent_proj/fc2_latent_proj remain BF16/FP16) so
it matches the actual quantizer configuration for fc1_latent_proj and
fc2_latent_proj.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 35-36: The description claims latent MoE is FP8 and that static
scales are "chosen by MSE", but the YAML sets no quantizers for
fc1_latent_proj/fc2_latent_proj and uses method: max; update the human-readable
description to match the actual config by either (A) enabling the latent
projection quantizers (fc1_latent_proj/fc2_latent_proj) to make latent MoE FP8,
or (B) explicitly state latent projections remain unquantized (not FP8) and
change the phrase "static scales are chosen by MSE" to reflect the configured
method (e.g., "static scales computed by max" or "method: max"); adjust the
description lines referencing "FP8 per-tensor; ... Latent MOE
fc1_latent_proj/fc2_latent_proj" and the sentence about static scale selection
accordingly so the text and the fields (fc1_latent_proj, fc2_latent_proj, and
method: max) are consistent.

In `@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml`:
- Around line 35-36: The description string misreports the latent-MoE policy:
update the description (the YAML description field) to reflect that
fc1_latent_proj and fc2_latent_proj are left unquantized (per the quant_cfg)
rather than being "FP8 per-tensor"; edit the text around the FP8/KV/latent-MoE
sentence to explicitly state that fc1_latent_proj/fc2_latent_proj remain
unquantized (or their actual precision) and keep FP8 per-tensor/KV wording only
for the tensors that truly use FP8.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 815-818: The current check only treats qformat is None as
excluded, but QUANTIZATION_NONE should be treated the same so those layers are
recorded as excluded (and not dropped from the exported quant config); update
the conditional in unified_export_megatron.py where qformat is obtained from
_get_quantization_format(module) to also consider QUANTIZATION_NONE (e.g., if
qformat is None or qformat == QUANTIZATION_NONE) before calling
_record_excluded_module(prefix), and ensure any references to
_record_layer_quant_config still see these explicitly disabled quantizers as
excluded rather than omitted.

---

Outside diff comments:
In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1213-1229: The code records quant metadata under the full tensor
key (e.g., "...weight") causing layer names to end with ".weight"; update both
_pack_name_remapping and _pack_name_remapping_gpt_oss to normalize the prefix
before calling self._record_layer_quant_config by stripping the tensor suffix
(e.g., remove trailing ".weight" or the last dot component) so the module-level
HF prefix is recorded instead of the weight key; apply the same normalization
logic where self._record_layer_quant_config(prefix, qformat, block_size) is
invoked in both functions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3adef0ee-1161-4068-a796-fc857c2455fd

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 5e32bd1.

📒 Files selected for processing (11)
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/calib/mse.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/plugins/megatron.py
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
  • tests/gpu_megatron/torch/export/test_unified_export_megatron.py

Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-amax.yaml Outdated
Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml Outdated
Comment thread modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml Outdated
Comment thread modelopt/torch/export/unified_export_megatron.py
jenchen13 added 3 commits May 1, 2026 13:53
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
@jenchen13
Copy link
Copy Markdown
Contributor Author

@CodeRabbit full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
modelopt/torch/quantization/qtensor/nvfp4_tensor.py (1)

120-122: 💤 Low value

_get_static_global_amax is called twice for the static branch.

_is_static_quantizer internally calls _get_static_global_amax and discards the result; line 122 then calls it again. Consider hoisting to a single variable:

♻️ Proposed refactor
-        if cls._is_static_quantizer(weight_quantizer):
-            # Static path: use pre-computed per-block amax values from quantizer
-            global_amax = cls._get_static_global_amax(weight_quantizer).float()
+        global_amax = cls._get_static_global_amax(weight_quantizer)
+        if global_amax is not None:
+            # Static path: use pre-computed per-block amax values from quantizer
+            global_amax = global_amax.float()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py` around lines 120 - 122,
Hoist the call to _get_static_global_amax so it's executed only once: call amax
= cls._get_static_global_amax(weight_quantizer) (and .float() as needed) and
reuse that value instead of calling _get_static_global_amax again; update
_is_static_quantizer to accept an optional precomputed amax parameter or
refactor _is_static_quantizer to avoid calling _get_static_global_amax
internally so the static branch uses the hoisted amax (refer to
_is_static_quantizer, _get_static_global_amax and weight_quantizer).
tests/gpu_megatron/torch/export/test_unified_export_megatron.py (1)

151-164: ⚡ Quick win

Add one end-to-end mixed-precision export case.

The new export path here is the per-layer layer_config_dict/process_layer_quant_config flow, but this matrix still only exercises uniform configs (NVFP4_DEFAULT_CFG / FP8_DEFAULT_CFG). A single mixed recipe case would catch regressions in quantized_layers, excludes, and the config.jsonhf_quant_config.json parity you just tightened.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py` around lines
151 - 164, Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`:
- Around line 45-79: The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.

---

Nitpick comments:
In `@modelopt/torch/quantization/qtensor/nvfp4_tensor.py`:
- Around line 120-122: Hoist the call to _get_static_global_amax so it's
executed only once: call amax = cls._get_static_global_amax(weight_quantizer)
(and .float() as needed) and reuse that value instead of calling
_get_static_global_amax again; update _is_static_quantizer to accept an optional
precomputed amax parameter or refactor _is_static_quantizer to avoid calling
_get_static_global_amax internally so the static branch uses the hoisted amax
(refer to _is_static_quantizer, _get_static_global_amax and weight_quantizer).

In `@tests/gpu_megatron/torch/export/test_unified_export_megatron.py`:
- Around line 151-164: Add a new parametrized test case to the existing
pytest.mark.parametrize matrix (the tuple of ("model_type", "arch",
"extra_module", "quant_config", "kv_cache_quant_cfg")) to exercise the per-layer
mixed-precision export path: supply a quant_config that triggers the
layer_config_dict / process_layer_quant_config flow (i.e., a mixed-recipe config
rather than NVFP4_DEFAULT_CFG or FP8_DEFAULT_CFG) so the test exercises
quantized_layers, excludes handling, and the config.json ↔ hf_quant_config.json
parity; update the matrix alongside existing entries (referencing the parameters
model_type/arch/extra_module/quant_config/kv_cache_quant_cfg) so one case uses a
mixed per-layer config for either nemotron or llama.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 94f3528f-2ee4-4757-b8de-3a323300ca74

📥 Commits

Reviewing files that changed from the base of the PR and between 8eec6d4 and 5de5541.

📒 Files selected for processing (10)
  • modelopt/torch/export/unified_export_megatron.py
  • modelopt/torch/quantization/calib/mse.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/plugins/megatron.py
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-fp8-sweep-stride4.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml
  • modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4.yaml
  • tests/gpu_megatron/torch/export/test_unified_export_megatron.py

Comment on lines +45 to +79
# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.
# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mixer.experts.*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mlp.experts*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep routed-expert weights static in the max-calib variant.

These two *...weight_quantizer entries are type: dynamic, so this recipe does more than swap method: mse for method: max—it changes routed-expert weights to dynamic NVFP4 and bypasses the static-weight calibration path entirely. That makes this variant a different quantization recipe, not a max-calibrated version of super-nvfp4.yaml.

Suggested fix
     - quantizer_name: '*mixer.experts.*weight_quantizer'
       enable: true
       cfg:
         block_sizes:
-          type: dynamic
+          type: static
           scale_bits: e4m3
         num_bits: e2m1
@@
     - quantizer_name: '*mlp.experts*weight_quantizer'
       enable: true
       cfg:
         block_sizes:
-          type: dynamic
+          type: static
           scale_bits: e4m3
         num_bits: e2m1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.
# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mixer.experts.*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mlp.experts*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
# MoE routed experts -> NVFP4 W4A4, block_size 16, e4m3 scale.
# HF/export names: backbone.layers.*.mixer.experts.*.{up,down}_proj.
- quantizer_name: '*mixer.experts.*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: static
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mixer.experts.*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
# Megatron-Core/PTQ names: decoder.layers.*.mlp.experts.local_experts.*.linear_fc{1,2}.
- quantizer_name: '*mlp.experts*weight_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: static
scale_bits: e4m3
num_bits: e2m1
- quantizer_name: '*mlp.experts*input_quantizer'
enable: true
cfg:
block_sizes:
-1: 16
type: dynamic
scale_bits: e4m3
num_bits: e2m1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@modelopt_recipes/models/Nemotron-3-Super-120B-A12B/super-nvfp4-max-calib.yaml`
around lines 45 - 79, The two routed-expert weight quantizer entries
('*mixer.experts.*weight_quantizer' and '*mlp.experts*weight_quantizer')
currently use type: dynamic; change them to use the static-weight calibration
path by replacing type: dynamic with type: static (or remove the dynamic setting
and ensure static is explicitly set) so routed-expert weights remain static in
this max-calib variant while leaving the corresponding input_quantizer entries
unchanged.

if hasattr(self, "kv_cache_dtype"):
self._hf_quant_config["quantization"]["kv_cache_quant_algo"] = self.kv_cache_dtype
# Use one serving-facing config for both hf_quant_config.json and config.json.
self._hf_quant_config = convert_hf_quant_config_format(raw_hf_quant_config)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we change the format of hf_quant_config.json by this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants