Skip to content

Add support matrix for Nemotron-3#1023

Merged
meenchen merged 4 commits intomainfrom
weimingc/nemotron_super_matrix
Mar 11, 2026
Merged

Add support matrix for Nemotron-3#1023
meenchen merged 4 commits intomainfrom
weimingc/nemotron_super_matrix

Conversation

@meenchen
Copy link
Copy Markdown
Contributor

@meenchen meenchen commented Mar 11, 2026

What does this PR do?

Type of change: documentation

Add support matrix for Nemotron-3

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • Nemotron-3 model quantization support added; Mixture-of-Experts expert support retained in auto-quantize scoring/grouping
  • Documentation

    • Hugging Face model support matrix updated to include Nemotron-3 and adjust referenced models
    • Quantization examples revised: nvfp4_mse with fp8 and effective precision ~4.75; example configs clarified

@meenchen meenchen self-assigned this Mar 11, 2026
@meenchen meenchen requested a review from a team as a code owner March 11, 2026 15:35
@meenchen meenchen requested a review from realAsma March 11, 2026 15:35
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 11, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Nemotron-3 (NemotronHForCausalLM) quantization support to auto_quantize grouping and scoring rules and updates example documentation to reference Nemotron/Qwen models and new quantization formats and bit settings.

Changes

Cohort / File(s) Summary
Changelog Update
CHANGELOG.rst
Adds Nemotron-3 (NemotronHForCausalLM) to auto_quantize grouping and scoring rules; retains NemotronH MoE expert support. (1 line changed)
Documentation Examples
examples/llm_ptq/README.md
Updates Hugging Face/model support lists to include Nemotron-3, replaces Llama-3 references with Qwen/Nemotron, and adjusts AutoQuantize example formats from w4a8_awq,fp8 (4.8) to nvfp4_mse,fp8 (4.75).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add support matrix for Nemotron-3' directly aligns with the main changes: adding Nemotron-3 to support matrices in CHANGELOG.rst and examples/llm_ptq/README.md.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Security Anti-Patterns ✅ Passed Pull request contains only documentation changes to CHANGELOG.rst and examples/llm_ptq/README.md with no new Python code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch weimingc/nemotron_super_matrix

Comment @coderabbitai help to get the list of available commands and usage tips.

@meenchen meenchen requested review from Fridah-nv and cjluo-nv March 11, 2026 15:35
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)

253-257: ⚠️ Potential issue | 🟠 Major

Inconsistent documentation: Command example doesn't match its description.

The command at line 253 uses nvfp4_mse,fp8 with --auto_quantize_bits 4.75, but the description at lines 256-257 still references the old parameters (w4a8_awq and 4.8 bits). This creates confusion for readers trying to understand the example.

📝 Proposed fix for the description
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers
+are quantized with `fp8` (or kept un-quantized for extremely sensitive layers) such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 253 - 257, The README example is
inconsistent: the command uses scripts/huggingface_example.sh with flags --quant
nvfp4_mse,fp8 and --auto_quantize_bits 4.75 but the description still mentions
w4a8_awq and 4.8 bits; update the paragraph describing AutoQuantize to match the
actual command by stating that less accuracy-sensitive layers are quantized with
nvfp4_mse,fp8 (as passed to --quant) while more sensitive layers are left
un-quantized, and change the effective bits reference from 4.8 to 4.75 (matching
--auto_quantize_bits 4.75).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README example is inconsistent: the command uses
scripts/huggingface_example.sh with flags --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 but the description still mentions w4a8_awq and 4.8
bits; update the paragraph describing AutoQuantize to match the actual command
by stating that less accuracy-sensitive layers are quantized with nvfp4_mse,fp8
(as passed to --quant) while more sensitive layers are left un-quantized, and
change the effective bits reference from 4.8 to 4.75 (matching
--auto_quantize_bits 4.75).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2050d605-5f02-4024-be2d-1b5775d29c97

📥 Commits

Reviewing files that changed from the base of the PR and between 26cad67 and baa8061f7f4eed0c5ce686c93f31bec1ec5c89b2.

📒 Files selected for processing (2)
  • CHANGELOG.rst
  • examples/llm_ptq/README.md

Copy link
Copy Markdown
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@meenchen meenchen enabled auto-merge (squash) March 11, 2026 15:56
Comment thread examples/llm_ptq/README.md Outdated
@meenchen meenchen disabled auto-merge March 11, 2026 15:57
@meenchen meenchen force-pushed the weimingc/nemotron_super_matrix branch 2 times, most recently from b4d5c51 to 0ff3a64 Compare March 11, 2026 16:02
@meenchen meenchen requested review from a team as code owners March 11, 2026 16:02
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)

244-257: ⚠️ Potential issue | 🟡 Minor

Sync the prose with the new AutoQuantize command.

Line 249 still references a LLaMA checkpoint, and Lines 256-257 still explain the old w4a8_awq / 4.8 example. That now contradicts the updated nvfp4_mse,fp8 command with --auto_quantize_bits 4.75, so the section is easy to misread.

Suggested doc fix
- export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card>
+ export HF_PATH=<the downloaded Qwen or Nemotron checkpoint from the Hugging Face hub, or simply the model card>
  # --auto_quantize_bits specifies the constraint for `AutoQuantize`
  # --quant specifies the formats to be searched for `AutoQuantize`
  # NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant
  scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75 --calib_batch_size 4

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` by searching across `nvfp4_mse` and `fp8`, targeting an effective quantized bit width of 4.75. Less sensitive layers can stay in the lower-bit format, while more sensitive layers can be assigned `fp8` to preserve accuracy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 244 - 257, Update the README prose
to match the new AutoQuantize invocation: replace the "LLaMA checkpoint" mention
with a generic Hugging Face model reference for HF_PATH (or "model card"), and
rewrite the explanatory paragraph after the script block to describe the current
flags used in scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8
example and instead explain that AutoQuantize will search nvfp4_mse and fp8
formats and target an average effective bit budget of 4.75 while keeping
sensitive layers at higher precision as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 244-257: Update the README prose to match the new AutoQuantize
invocation: replace the "LLaMA checkpoint" mention with a generic Hugging Face
model reference for HF_PATH (or "model card"), and rewrite the explanatory
paragraph after the script block to describe the current flags used in
scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8
example and instead explain that AutoQuantize will search nvfp4_mse and fp8
formats and target an average effective bit budget of 4.75 while keeping
sensitive layers at higher precision as needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b7426913-10cc-47b9-b7ff-ff6d54c0a7cc

📥 Commits

Reviewing files that changed from the base of the PR and between 9857d36f743a4806b96fe4b69b2f711e105e44ad and b4d5c518d9166ba2888c653edc6ec328cd917591.

📒 Files selected for processing (1)
  • examples/llm_ptq/README.md

@kevalmorabia97
Copy link
Copy Markdown
Collaborator

@meenchen commits seem to have messed up. Showing 8k lines changed

@meenchen meenchen force-pushed the weimingc/nemotron_super_matrix branch from 0ff3a64 to baa8061 Compare March 11, 2026 16:09
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen force-pushed the weimingc/nemotron_super_matrix branch from bceae22 to ab2fd34 Compare March 11, 2026 16:16
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
@meenchen
Copy link
Copy Markdown
Contributor Author

@meenchen commits seem to have messed up. Showing 8k lines changed

Fixed now

@meenchen meenchen enabled auto-merge (squash) March 11, 2026 16:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)

253-257: ⚠️ Potential issue | 🟠 Major

Update the explanation to match the command example.

The command on line 253 uses nvfp4_mse,fp8 with 4.75 bits, but the explanation on lines 256-257 still references the old values (w4a8_awq and 4.8 bits). This inconsistency will confuse users.

📝 Proposed fix to align the explanation
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers
-are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`).
+The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers
+are quantized with `fp8` or kept un-quantized such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` around lines 253 - 257, The README explanation is
inconsistent with the example command; update the explanatory text that
currently references `w4a8_awq` and `4.8` to match the command
`scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8
--auto_quantize_bits 4.75`, describing that less sensitive layers are quantized
with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`) and the effective
bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so the prose and the
flags `--quant` and `--auto_quantize_bits` in the README align.
🧹 Nitpick comments (1)
CHANGELOG.rst (1)

26-26: Simplify redundant phrasing.

The sentence contains redundant use of "support" three times. Consider rephrasing for clarity:

-- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules.
+- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and scoring rules.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CHANGELOG.rst` at line 26, The sentence in the changelog is redundant;
rewrite it to remove repeated "support" while keeping meaning — for example,
change the line that mentions "Nemotron-3 (NemotronHForCausalLM) model
quantization and support for NemotronH MoE expert support in ``auto_quantize``
grouping and scoring rules" to a concise form that mentions Nemotron-3
quantization and NemotronH MoE expert handling in ``auto_quantize``
grouping/scoring (e.g., "Add Nemotron-3 (NemotronHForCausalLM) model
quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and
scoring rules").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README explanation is inconsistent with the example
command; update the explanatory text that currently references `w4a8_awq` and
`4.8` to match the command `scripts/huggingface_example.sh --model $HF_PATH
--quant nvfp4_mse,fp8 --auto_quantize_bits 4.75`, describing that less sensitive
layers are quantized with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`)
and the effective bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so
the prose and the flags `--quant` and `--auto_quantize_bits` in the README
align.

---

Nitpick comments:
In `@CHANGELOG.rst`:
- Line 26: The sentence in the changelog is redundant; rewrite it to remove
repeated "support" while keeping meaning — for example, change the line that
mentions "Nemotron-3 (NemotronHForCausalLM) model quantization and support for
NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules" to
a concise form that mentions Nemotron-3 quantization and NemotronH MoE expert
handling in ``auto_quantize`` grouping/scoring (e.g., "Add Nemotron-3
(NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in
``auto_quantize`` grouping and scoring rules").

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d4ed63b-c56a-4c2d-890b-205fbdfbe982

📥 Commits

Reviewing files that changed from the base of the PR and between b4d5c518d9166ba2888c653edc6ec328cd917591 and f168065.

📒 Files selected for processing (2)
  • CHANGELOG.rst
  • examples/llm_ptq/README.md

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.09%. Comparing base (26cad67) to head (4d0d059).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   70.09%   70.09%           
=======================================
  Files         221      221           
  Lines       25459    25459           
=======================================
  Hits        17845    17845           
  Misses       7614     7614           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@meenchen meenchen disabled auto-merge March 11, 2026 16:45
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen enabled auto-merge (squash) March 11, 2026 16:47
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/llm_ptq/README.md (1)

256-256: Consider improving readability with hyphenation or restructuring.

The phrase "less quantization accuracy sensitive layers" would be clearer with hyphens as a compound modifier: "less quantization-accuracy-sensitive layers", or restructured as "layers that are less sensitive to quantization accuracy".

📝 Proposed improvement
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
+The above example perform `AutoQuantize` where the less quantization-accuracy-sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers

Or alternatively:

-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
+The above example perform `AutoQuantize` where layers that are less sensitive to quantization accuracy are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/README.md` at line 256, The sentence in the README example
describing AutoQuantize uses the awkward phrase "less quantization accuracy
sensitive layers"; update the wording around the AutoQuantize example
(mentioning --quant nvfp4_mse and AutoQuantize) to either hyphenate as "less
quantization-accuracy-sensitive layers" or rephrase to "layers that are less
sensitive to quantization accuracy" and adjust the following clause similarly
for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/README.md`:
- Line 256: The sentence in the README example describing AutoQuantize uses the
awkward phrase "less quantization accuracy sensitive layers"; update the wording
around the AutoQuantize example (mentioning --quant nvfp4_mse and AutoQuantize)
to either hyphenate as "less quantization-accuracy-sensitive layers" or rephrase
to "layers that are less sensitive to quantization accuracy" and adjust the
following clause similarly for consistency.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 06d05b9a-645e-4899-bc52-31df2fa8efd3

📥 Commits

Reviewing files that changed from the base of the PR and between f168065 and de4c331.

📒 Files selected for processing (1)
  • examples/llm_ptq/README.md

Copy link
Copy Markdown
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChenhanYu
Copy link
Copy Markdown
Collaborator

@meenchen, cloud you provide an update to our main README.md and probably provide some pointers regarding the checkpoints on HF hub?

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8
https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/

@meenchen meenchen disabled auto-merge March 11, 2026 17:11
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
@meenchen meenchen requested a review from a team as a code owner March 11, 2026 17:29
@meenchen
Copy link
Copy Markdown
Contributor Author

@meenchen, cloud you provide an update to our main README.md and probably provide some pointers regarding the checkpoints on HF hub?

https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/

updated the news

@meenchen meenchen enabled auto-merge (squash) March 11, 2026 17:29
@meenchen meenchen merged commit 34a9fc7 into main Mar 11, 2026
38 checks passed
@meenchen meenchen deleted the weimingc/nemotron_super_matrix branch March 11, 2026 18:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants