Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds Nemotron-3 (NemotronHForCausalLM) quantization support to auto_quantize grouping and scoring rules and updates example documentation to reference Nemotron/Qwen models and new quantization formats and bit settings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)
253-257:⚠️ Potential issue | 🟠 MajorInconsistent documentation: Command example doesn't match its description.
The command at line 253 uses
nvfp4_mse,fp8with--auto_quantize_bits 4.75, but the description at lines 256-257 still references the old parameters (w4a8_awqand4.8bits). This creates confusion for readers trying to understand the example.📝 Proposed fix for the description
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers -are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`). +The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers +are quantized with `fp8` (or kept un-quantized for extremely sensitive layers) such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/README.md` around lines 253 - 257, The README example is inconsistent: the command uses scripts/huggingface_example.sh with flags --quant nvfp4_mse,fp8 and --auto_quantize_bits 4.75 but the description still mentions w4a8_awq and 4.8 bits; update the paragraph describing AutoQuantize to match the actual command by stating that less accuracy-sensitive layers are quantized with nvfp4_mse,fp8 (as passed to --quant) while more sensitive layers are left un-quantized, and change the effective bits reference from 4.8 to 4.75 (matching --auto_quantize_bits 4.75).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README example is inconsistent: the command uses
scripts/huggingface_example.sh with flags --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 but the description still mentions w4a8_awq and 4.8
bits; update the paragraph describing AutoQuantize to match the actual command
by stating that less accuracy-sensitive layers are quantized with nvfp4_mse,fp8
(as passed to --quant) while more sensitive layers are left un-quantized, and
change the effective bits reference from 4.8 to 4.75 (matching
--auto_quantize_bits 4.75).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2050d605-5f02-4024-be2d-1b5775d29c97
📥 Commits
Reviewing files that changed from the base of the PR and between 26cad67 and baa8061f7f4eed0c5ce686c93f31bec1ec5c89b2.
📒 Files selected for processing (2)
CHANGELOG.rstexamples/llm_ptq/README.md
b4d5c51 to
0ff3a64
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)
244-257:⚠️ Potential issue | 🟡 MinorSync the prose with the new AutoQuantize command.
Line 249 still references a LLaMA checkpoint, and Lines 256-257 still explain the old
w4a8_awq/4.8example. That now contradicts the updatednvfp4_mse,fp8command with--auto_quantize_bits 4.75, so the section is easy to misread.Suggested doc fix
- export HF_PATH=<the downloaded LLaMA checkpoint from the Hugging Face hub, or simply the model card> + export HF_PATH=<the downloaded Qwen or Nemotron checkpoint from the Hugging Face hub, or simply the model card> # --auto_quantize_bits specifies the constraint for `AutoQuantize` # --quant specifies the formats to be searched for `AutoQuantize` # NOTE: auto_quantize_bits cannot be lower than the number of bits for the smallest quantization format in --quant scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75 --calib_batch_size 4 -The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers -are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`). +The above example performs `AutoQuantize` by searching across `nvfp4_mse` and `fp8`, targeting an effective quantized bit width of 4.75. Less sensitive layers can stay in the lower-bit format, while more sensitive layers can be assigned `fp8` to preserve accuracy.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/README.md` around lines 244 - 257, Update the README prose to match the new AutoQuantize invocation: replace the "LLaMA checkpoint" mention with a generic Hugging Face model reference for HF_PATH (or "model card"), and rewrite the explanatory paragraph after the script block to describe the current flags used in scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and --auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8 example and instead explain that AutoQuantize will search nvfp4_mse and fp8 formats and target an average effective bit budget of 4.75 while keeping sensitive layers at higher precision as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 244-257: Update the README prose to match the new AutoQuantize
invocation: replace the "LLaMA checkpoint" mention with a generic Hugging Face
model reference for HF_PATH (or "model card"), and rewrite the explanatory
paragraph after the script block to describe the current flags used in
scripts/huggingface_example.sh (i.e., --quant nvfp4_mse,fp8 and
--auto_quantize_bits 4.75 --calib_batch_size 4), removing the old w4a8_awq/4.8
example and instead explain that AutoQuantize will search nvfp4_mse and fp8
formats and target an average effective bit budget of 4.75 while keeping
sensitive layers at higher precision as needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: b7426913-10cc-47b9-b7ff-ff6d54c0a7cc
📥 Commits
Reviewing files that changed from the base of the PR and between 9857d36f743a4806b96fe4b69b2f711e105e44ad and b4d5c518d9166ba2888c653edc6ec328cd917591.
📒 Files selected for processing (1)
examples/llm_ptq/README.md
|
@meenchen commits seem to have messed up. Showing 8k lines changed |
0ff3a64 to
baa8061
Compare
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
bceae22 to
ab2fd34
Compare
Signed-off-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com>
Fixed now |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/README.md (1)
253-257:⚠️ Potential issue | 🟠 MajorUpdate the explanation to match the command example.
The command on line 253 uses
nvfp4_mse,fp8with4.75bits, but the explanation on lines 256-257 still references the old values (w4a8_awqand4.8bits). This inconsistency will confuse users.📝 Proposed fix to align the explanation
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `w4a8_awq` (specified by `--quant w4a8_awq`) and the more sensitive layers -are kept un-quantized such that the effective bits is 4.8 (specified by `--auto_quantize_bits 4.8`). +The above example performs `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse,fp8`) and the more sensitive layers +are quantized with `fp8` or kept un-quantized such that the effective bits is 4.75 (specified by `--auto_quantize_bits 4.75`).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/README.md` around lines 253 - 257, The README explanation is inconsistent with the example command; update the explanatory text that currently references `w4a8_awq` and `4.8` to match the command `scripts/huggingface_example.sh --model $HF_PATH --quant nvfp4_mse,fp8 --auto_quantize_bits 4.75`, describing that less sensitive layers are quantized with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`) and the effective bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so the prose and the flags `--quant` and `--auto_quantize_bits` in the README align.
🧹 Nitpick comments (1)
CHANGELOG.rst (1)
26-26: Simplify redundant phrasing.The sentence contains redundant use of "support" three times. Consider rephrasing for clarity:
-- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules. +- Add support for Nemotron-3 (NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and scoring rules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@CHANGELOG.rst` at line 26, The sentence in the changelog is redundant; rewrite it to remove repeated "support" while keeping meaning — for example, change the line that mentions "Nemotron-3 (NemotronHForCausalLM) model quantization and support for NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules" to a concise form that mentions Nemotron-3 quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping/scoring (e.g., "Add Nemotron-3 (NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in ``auto_quantize`` grouping and scoring rules").
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@examples/llm_ptq/README.md`:
- Around line 253-257: The README explanation is inconsistent with the example
command; update the explanatory text that currently references `w4a8_awq` and
`4.8` to match the command `scripts/huggingface_example.sh --model $HF_PATH
--quant nvfp4_mse,fp8 --auto_quantize_bits 4.75`, describing that less sensitive
layers are quantized with `nvfp4_mse,fp8` (as given by `--quant nvfp4_mse,fp8`)
and the effective bits are `4.75` (as given by `--auto_quantize_bits 4.75`) so
the prose and the flags `--quant` and `--auto_quantize_bits` in the README
align.
---
Nitpick comments:
In `@CHANGELOG.rst`:
- Line 26: The sentence in the changelog is redundant; rewrite it to remove
repeated "support" while keeping meaning — for example, change the line that
mentions "Nemotron-3 (NemotronHForCausalLM) model quantization and support for
NemotronH MoE expert support in ``auto_quantize`` grouping and scoring rules" to
a concise form that mentions Nemotron-3 quantization and NemotronH MoE expert
handling in ``auto_quantize`` grouping/scoring (e.g., "Add Nemotron-3
(NemotronHForCausalLM) model quantization and NemotronH MoE expert handling in
``auto_quantize`` grouping and scoring rules").
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d4ed63b-c56a-4c2d-890b-205fbdfbe982
📥 Commits
Reviewing files that changed from the base of the PR and between b4d5c518d9166ba2888c653edc6ec328cd917591 and f168065.
📒 Files selected for processing (2)
CHANGELOG.rstexamples/llm_ptq/README.md
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1023 +/- ##
=======================================
Coverage 70.09% 70.09%
=======================================
Files 221 221
Lines 25459 25459
=======================================
Hits 17845 17845
Misses 7614 7614 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/llm_ptq/README.md (1)
256-256: Consider improving readability with hyphenation or restructuring.The phrase "less quantization accuracy sensitive layers" would be clearer with hyphens as a compound modifier: "less quantization-accuracy-sensitive layers", or restructured as "layers that are less sensitive to quantization accuracy".
📝 Proposed improvement
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers +The above example perform `AutoQuantize` where the less quantization-accuracy-sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layersOr alternatively:
-The above example perform `AutoQuantize` where the less quantization accuracy sensitive layers are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers +The above example perform `AutoQuantize` where layers that are less sensitive to quantization accuracy are quantized with `nvfp4_mse` (specified by `--quant nvfp4_mse`) and the more sensitive layers🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/README.md` at line 256, The sentence in the README example describing AutoQuantize uses the awkward phrase "less quantization accuracy sensitive layers"; update the wording around the AutoQuantize example (mentioning --quant nvfp4_mse and AutoQuantize) to either hyphenate as "less quantization-accuracy-sensitive layers" or rephrase to "layers that are less sensitive to quantization accuracy" and adjust the following clause similarly for consistency.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/llm_ptq/README.md`:
- Line 256: The sentence in the README example describing AutoQuantize uses the
awkward phrase "less quantization accuracy sensitive layers"; update the wording
around the AutoQuantize example (mentioning --quant nvfp4_mse and AutoQuantize)
to either hyphenate as "less quantization-accuracy-sensitive layers" or rephrase
to "layers that are less sensitive to quantization accuracy" and adjust the
following clause similarly for consistency.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06d05b9a-645e-4899-bc52-31df2fa8efd3
📒 Files selected for processing (1)
examples/llm_ptq/README.md
|
@meenchen, cloud you provide an update to our main https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8 |
Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>
updated the news |
What does this PR do?
Type of change: documentation
Add support matrix for Nemotron-3
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Documentation