Add support matrix for Nemotron-3 by meenchen · Pull Request #1023 · NVIDIA/Model-Optimizer

meenchen · 2026-03-11T15:35:18Z

What does this PR do?

Type of change: documentation

Add support matrix for Nemotron-3

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- Nemotron-3 model quantization support added; Mixture-of-Experts expert support retained in auto-quantize scoring/grouping
Documentation
- Hugging Face model support matrix updated to include Nemotron-3 and adjust referenced models
- Quantization examples revised: nvfp4_mse with fp8 and effective precision ~4.75; example configs clarified

coderabbitai · 2026-03-11T15:35:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds Nemotron-3 (NemotronHForCausalLM) quantization support to auto_quantize grouping and scoring rules and updates example documentation to reference Nemotron/Qwen models and new quantization formats and bit settings.

Changes

Cohort / File(s)	Summary
Changelog Update `CHANGELOG.rst`	Adds Nemotron-3 (NemotronHForCausalLM) to auto_quantize grouping and scoring rules; retains NemotronH MoE expert support. (1 line changed)
Documentation Examples `examples/llm_ptq/README.md`	Updates Hugging Face/model support lists to include Nemotron-3, replaces Llama-3 references with Qwen/Nemotron, and adjusts AutoQuantize example formats from `w4a8_awq,fp8` (4.8) to `nvfp4_mse,fp8` (4.75).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add support matrix for Nemotron-3' directly aligns with the main changes: adding Nemotron-3 to support matrices in CHANGELOG.rst and examples/llm_ptq/README.md.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns	✅ Passed	Pull request contains only documentation changes to CHANGELOG.rst and examples/llm_ptq/README.md with no new Python code modifications.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch weimingc/nemotron_super_matrix

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/llm_ptq/README.md (1)

253-257: ⚠️ Potential issue | 🟠 Major

Inconsistent documentation: Command example doesn't match its description.

The command at line 253 uses nvfp4_mse,fp8 with --auto_quantize_bits 4.75, but the description at lines 256-257 still references the old parameters (w4a8_awq and 4.8 bits). This creates confusion for readers trying to understand the example.

📝 Proposed fix for the description

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers
+are quantized with `fp8` (or kept un-quantized for extremely sensitive layers) such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 253 - 257, The README example is
inconsistent: the command uses scripts/huggingface_example.sh with flags --quant
nvfp4_mse,fp8 and --auto_quantize_bits 4.75 but the description still mentions
w4a8_awq and 4.8 bits; update the paragraph describing AutoQuantize to match the
actual command by stating that less accuracy-sensitive layers are quantized with
nvfp4_mse,fp8 (as passed to --quant) while more sensitive layers are left
un-quantized, and change the effective bits reference from 4.8 to 4.75 (matching
--auto_quantize_bits 4.75).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README example is inconsistent: the command uses
scripts/huggingface_example.sh with flags --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 but the description still mentions w4a8_awq and 4.8
bits; update the paragraph describing AutoQuantize to match the actual command
by stating that less accuracy-sensitive layers are quantized with nvfp4_mse,fp8
(as passed to --quant) while more sensitive layers are left un-quantized, and
change the effective bits reference from 4.8 to 4.75 (matching
--auto_quantize_bits 4.75).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2050d605-5f02-4024-be2d-1b5775d29c97

📥 Commits

Reviewing files that changed from the base of the PR and between 26cad67 and baa8061f7f4eed0c5ce686c93f31bec1ec5c89b2.

📒 Files selected for processing (2)

CHANGELOG.rst
examples/llm_ptq/README.md

realAsma

Awesome!

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/llm_ptq/README.md (1)

244-257: ⚠️ Potential issue | 🟡 Minor

Sync the prose with the new AutoQuantize command.

Line 249 still references a LLaMA checkpoint, and Lines 256-257 still explain the old w4a8_awq / 4.8 example. That now contradicts the updated nvfp4_mse,fp8 command with --auto_quantize_bits 4.75, so the section is easy to misread.

Suggested doc fix

- export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
+ export HF_PATH=<the downloaded Qwen or Nemotron checkpoint from the Hugging Face hub, or simply the model card>
  # --auto_quantize_bits specifies the constraint for `AutoQuantize`
  # --quant specifies the formats to be searched for `AutoQuantize`
  # NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
  scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75 --calib_batch_size 4

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` by searching across `nvfp4_mse` and `fp8`, targeting an effective quantized bit width of 4.75. Less sensitive layers can stay in the lower-bit format, while more sensitive layers can be assigned `fp8` to preserve accuracy.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 244 - 257, Update the README prose
to match the new AutoQuantize invocation: replace the "LLaMA checkpoint" mention
with a generic Hugging Face model reference for HF_PATH (or "model card"), and
rewrite the explanatory paragraph after the script block to describe the current
flags used in scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8
example and instead explain that AutoQuantize will search nvfp4_mse and fp8
formats and target an average effective bit budget of 4.75 while keeping
sensitive layers at higher precision as needed.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 244-257: Update the README prose to match the new AutoQuantize
invocation: replace the "LLaMA checkpoint" mention with a generic Hugging Face
model reference for HF_PATH (or "model card"), and rewrite the explanatory
paragraph after the script block to describe the current flags used in
scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8
example and instead explain that AutoQuantize will search nvfp4_mse and fp8
formats and target an average effective bit budget of 4.75 while keeping
sensitive layers at higher precision as needed.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b7426913-10cc-47b9-b7ff-ff6d54c0a7cc

📥 Commits

Reviewing files that changed from the base of the PR and between 9857d36f743a4806b96fe4b69b2f711e105e44ad and b4d5c518d9166ba2888c653edc6ec328cd917591.

📒 Files selected for processing (1)

examples/llm_ptq/README.md

kevalmorabia97 · 2026-03-11T16:06:57Z

@meenchen commits seem to have messed up. Showing 8k lines changed

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>

meenchen · 2026-03-11T16:17:40Z

@meenchen commits seem to have messed up. Showing 8k lines changed

Fixed now

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/llm_ptq/README.md (1)

253-257: ⚠️ Potential issue | 🟠 Major

Update the explanation to match the command example.

The command on line 253 uses nvfp4_mse,fp8 with 4.75 bits, but the explanation on lines 256-257 still references the old values (w4a8_awq and 4.8 bits). This inconsistency will confuse users.

📝 Proposed fix to align the explanation

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers
+are quantized with `fp8` or kept un-quantized such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 253 - 257, The README explanation is
inconsistent with the example command; update the explanatory text that
currently references `w4a8_awq` and `4.8` to match the command
`scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8
--auto_quantize_bits 4.75`, describing that less sensitive layers are quantized
with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`) and the effective
bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so the prose and the
flags `--quant` and `--auto_quantize_bits` in the README align.

🧹 Nitpick comments (1)

CHANGELOG.rst (1)

26-26: Simplify redundant phrasing.

The sentence contains redundant use of "support" three times. Consider rephrasing for clarity:

-- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
+- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and scoring rules.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.rst` at line 26, The sentence in the changelog is redundant;
rewrite it to remove repeated "support" while keeping meaning — for example,
change the line that mentions "Nemotron-3 (NemotronHForCausalLM) model
quantization and support for NemotronH MoE expert support in ``auto_quantize``
grouping and scoring rules" to a concise form that mentions Nemotron-3
quantization and NemotronH MoE expert handling in ``auto_quantize``
grouping/scoring (e.g., "Add Nemotron-3 (NemotronHForCausalLM) model
quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and
scoring rules").

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README explanation is inconsistent with the example
command; update the explanatory text that currently references `w4a8_awq` and
`4.8` to match the command `scripts/huggingface_example.sh --model $HF_PATH
--quant nvfp4_mse,fp8 --auto_quantize_bits 4.75`, describing that less sensitive
layers are quantized with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`)
and the effective bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so
the prose and the flags `--quant` and `--auto_quantize_bits` in the README
align.

---

Nitpick comments:
In `@CHANGELOG.rst`:
- Line 26: The sentence in the changelog is redundant; rewrite it to remove
repeated "support" while keeping meaning — for example, change the line that
mentions "Nemotron-3 (NemotronHForCausalLM) model quantization and support for
NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules" to
a concise form that mentions Nemotron-3 quantization and NemotronH MoE expert
handling in ``auto_quantize`` grouping/scoring (e.g., "Add Nemotron-3
(NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in
``auto_quantize`` grouping and scoring rules").

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d4ed63b-c56a-4c2d-890b-205fbdfbe982

📥 Commits

Reviewing files that changed from the base of the PR and between b4d5c518d9166ba2888c653edc6ec328cd917591 and f168065.

📒 Files selected for processing (2)

CHANGELOG.rst
examples/llm_ptq/README.md

codecov · 2026-03-11T16:28:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (26cad67) to head (4d0d059).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   70.09%   70.09%           
=======================================
  Files         221      221           
  Lines       25459    25459           
=======================================
  Hits        17845    17845           
  Misses       7614     7614

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

examples/llm_ptq/README.md (1)

256-256: Consider improving readability with hyphenation or restructuring.

The phrase "less quantization accuracy sensitive layers" would be clearer with hyphens as a compound modifier: "less quantization-accuracy-sensitive layers", or restructured as "layers that are less sensitive to quantization accuracy".

📝 Proposed improvement

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
+The above example perform `AutoQuantize` where the less quantization-accuracy-sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers

Or alternatively:

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
+The above example perform `AutoQuantize` where layers that are less sensitive to quantization accuracy are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` at line 256, The sentence in the README example
describing AutoQuantize uses the awkward phrase "less quantization accuracy
sensitive layers"; update the wording around the AutoQuantize example
(mentioning --quant nvfp4_mse and AutoQuantize) to either hyphenate as "less
quantization-accuracy-sensitive layers" or rephrase to "layers that are less
sensitive to quantization accuracy" and adjust the following clause similarly
for consistency.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/README.md`:
- Line 256: The sentence in the README example describing AutoQuantize uses the
awkward phrase "less quantization accuracy sensitive layers"; update the wording
around the AutoQuantize example (mentioning --quant nvfp4_mse and AutoQuantize)
to either hyphenate as "less quantization-accuracy-sensitive layers" or rephrase
to "layers that are less sensitive to quantization accuracy" and adjust the
following clause similarly for consistency.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06d05b9a-645e-4899-bc52-31df2fa8efd3

📥 Commits

Reviewing files that changed from the base of the PR and between f168065 and de4c331.

📒 Files selected for processing (1)

examples/llm_ptq/README.md

Edwardf0t1

LGTM

ChenhanYu · 2026-03-11T17:07:50Z

@meenchen, cloud you provide an update to our main README.md and probably provide some pointers regarding the checkpoints on HF hub?

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen · 2026-03-11T17:29:34Z

@meenchen, cloud you provide an update to our main README.md and probably provide some pointers regarding the checkpoints on HF hub?

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/

updated the news

meenchen self-assigned this Mar 11, 2026

meenchen requested a review from a team as a code owner March 11, 2026 15:35

meenchen requested a review from realAsma March 11, 2026 15:35

meenchen requested review from Fridah-nv and cjluo-nv March 11, 2026 15:35

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

realAsma approved these changes Mar 11, 2026

View reviewed changes

meenchen enabled auto-merge (squash) March 11, 2026 15:56

realAsma reviewed Mar 11, 2026

View reviewed changes

Comment thread examples/llm_ptq/README.md Outdated

meenchen disabled auto-merge March 11, 2026 15:57

meenchen force-pushed the weimingc/nemotron_super_matrix branch 2 times, most recently from b4d5c51 to 0ff3a64 Compare March 11, 2026 16:02

meenchen requested review from a team as code owners March 11, 2026 16:02

meenchen requested review from AAnoosheh and kevalmorabia97 March 11, 2026 16:02

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

meenchen force-pushed the weimingc/nemotron_super_matrix branch from 0ff3a64 to baa8061 Compare March 11, 2026 16:09

cjluo-nv approved these changes Mar 11, 2026

View reviewed changes

add support matrix

ab2fd34

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen force-pushed the weimingc/nemotron_super_matrix branch from bceae22 to ab2fd34 Compare March 11, 2026 16:16

Merge branch 'main' into weimingc/nemotron_super_matrix

f168065

Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>

meenchen enabled auto-merge (squash) March 11, 2026 16:17

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

meenchen disabled auto-merge March 11, 2026 16:45

minor

de4c331

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen enabled auto-merge (squash) March 11, 2026 16:47

coderabbitai Bot reviewed Mar 11, 2026

View reviewed changes

Edwardf0t1 approved these changes Mar 11, 2026

View reviewed changes

meenchen disabled auto-merge March 11, 2026 17:11

update news

4d0d059

Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

meenchen requested a review from a team as a code owner March 11, 2026 17:29

meenchen enabled auto-merge (squash) March 11, 2026 17:29

meenchen merged commit 34a9fc7 into main Mar 11, 2026
38 checks passed

meenchen deleted the weimingc/nemotron_super_matrix branch March 11, 2026 18:12

Conversation

meenchen commented Mar 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

kevalmorabia97 commented Mar 11, 2026

Uh oh!

meenchen commented Mar 11, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

ChenhanYu commented Mar 11, 2026

Uh oh!

meenchen commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

meenchen commented Mar 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 11, 2026 •

edited

Loading

codecov Bot commented Mar 11, 2026 •

edited

Loading