Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe by ajrasane · Pull Request #1327 · NVIDIA/Model-Optimizer

ajrasane · 2026-04-22T22:33:53Z

Related PR

Aligns with #1313 (Support NVFP4 W4A16 quantization) — shares the NVFP4_W4A16_CFG recipe, the nvfp4_w4a16 qformat, and the QUANTIZATION_NVFP4_W4A16 format ID. This PR adds the embedding + lm_head quantization support on top.

Summary

Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two 131072×3136 tables are ~21% of parameters and leaving them in bf16 wastes most of the compression.

Changes

Core quantization library

modelopt/torch/quantization/nn/modules/quant_embedding.py (new) — Register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that inherits QuantLinearConvBase but disables the input_quantizer by default (embedding inputs are integer indices, not activations; output_quantizer is already disabled by QuantInputBase._setup).
modelopt/torch/quantization/nn/__init__.py — Import the new quant_embedding module so registration fires at library import time.

Export

modelopt/torch/export/unified_export_hf.py — _process_quantized_modules now also walks quantized Embedding modules (previously is_quantlinear-only), so the NVFP4 packing + scale registration path in _export_quantized_weight runs for them on export.

`hf_ptq.py` example

For model_type == "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer and target the backbone embedding (*embeddings* / *embed_tokens*), overriding the default *lm_head* disable in _default_disabled_quantizer_cfg. Guarded helpers (_enable_lm_head_and_embedding_quantization, _extract_weight_quantizer_cfg) so the override only fires when a standard *weight_quantizer entry is present.

`example_utils.py` — environment workarounds

These are idempotent workarounds for transformers 5.5.x's partial Nemotron-H port; they no-op on a fixed transformers (e.g. inside the TRT Docker container's newer wheel):

NemotronHConfig._pattern_to_list: add - → mlp
ALLOWED_LAYER_TYPES: add "mlp"
NemotronHConfig.validate_layers_block_type: accept "mlp" (also update __class_validators__ since huggingface_hub's @strict_dataclass snapshots validators at class-creation time, so overwriting the method attribute alone isn't enough)
MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the layer_idx kwarg passed by NemotronHBlock
NemotronHBlock.__init__: alias block_type == "mlp" → "moe" so the inline block_type_to_mask lookup in NemotronHModel.forward resolves to None (dispatch is unaffected — the block's forward routes both through the same else branch that calls self.mixer(hidden_states))
generation_config: set do_sample=True when sampling hyperparams are set, so export's save_pretrained passes transformers 5.x strict validation

Validation

End-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    --qformat nvfp4_w4a16 --kv_cache_qformat none \
    --trust_remote_code --dataset cnn_dailymail \
    --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
    --export_path /tmp/nemotron_3_nano_4b_nvfp4_w4a16

Produces a 2.2 GB unified HF checkpoint (vs 7.5 GB bf16), with model.embeddings.weight and lm_head.weight both stored as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scale. hf_quant_config.json reports quant_algo: NVFP4_W4A16, group_size: 16, and exclude_modules contains only the 21 Mamba conv1d layers (the default _default_disabled_quantizer_cfg entry for *mixer.conv1d*).

pre-commit run --files <staged> passes (ruff, ruff-format, mypy, bandit, insert-license, rst-lint).

Follow-ups (separate PRs)

Compressed-tensors conversion script for vLLM consumption: renames *.weight → *.weight_packed, *.weight_scale_2 → *.weight_global_scale (inverted), and rewrites config.json quantization_config to format: nvfp4-pack-quantized / quant_method: compressed-tensors. Already prototyped out-of-tree; just needs cleanup + tests.
Offline vLLM inference script for the converted checkpoint (CLI wrapping vllm.LLM with chat-template rendering, max_model_len cap, --enforce-eager default for Mamba/SSM). Already prototyped out-of-tree.
Nemotron-H config.json post-export cleanup: transformers 5.x strips hybrid_override_pattern in favor of the derived layers_block_type list, which breaks reload via the checkpoint's remote configuration_nemotron_h.py (its layers_block_type is a read-only @property). The export path should restore hybrid_override_pattern and set num_hidden_layers explicitly for model_type == "nemotron_h".
Optional --vllm-compat hf_ptq flag that additionally excludes Mamba in_proj (output dim 17504 = intermediate + conv_dim + num_heads isn't divisible by 64, violating vLLM's Marlin repack alignment) and leaves lm_head / model.embeddings in bf16 (vLLM's ParallelLMHead / VocabParallelEmbedding don't consume compressed-tensors scales), so the export is consumable by vLLM out of the box.
Upstream the transformers 5.5.x Nemotron-H fixes so the example_utils.py monkey-patches can be dropped.

Test plan

pre-commit (ruff, ruff-format, mypy, bandit, insert-license) passes on the staged files.
Smoke test that nn.Embedding registers and is replaced with QuantEmbedding under mtq.quantize(..., NVFP4_W4A16_CFG, forward_loop=None) on a toy Sequential(Embedding, Linear) model; verified forward pass on CUDA.
End-to-end PTQ + unified HF export on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (see Validation above).
GPU unit/integration test under tests/gpu/torch/export/ for nn.Embedding weight packing — follow-up once the conversion/load path lands.
Multi-GPU / tensor-parallel export path — not exercised; Nemotron-H's accelerate-plus-multi-GPU path is already flagged as known-broken in hf_ptq.py, and this PR doesn't change that.

Summary by CodeRabbit

New Features
- NVFP4 W4A16 weight‑only quantization added (embeddings + lm_head support) and new PTQ recipe.
- Embedding quantization module exported for use in the pipeline.
- Added --exclude_modules CLI to selectively omit modules from quantization.
- Export now temporarily normalizes generation settings during export and warns about nvfp4_w4a16 runtime compatibility.
Integration
- Exporter, conversion, and example scripts updated to recognize nvfp4_w4a16 and pass exclusion args.
Tests
- Added export/safetensors test case for nvfp4_w4a16.

…cipe Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two 131072x3136 tables are ~21% of parameters and leaving them in bf16 wastes most of the compression. Changes - modelopt/torch/quantization/nn/modules/quant_embedding.py (new): register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that inherits QuantLinearConvBase but disables the input_quantizer by default (embedding inputs are integer indices, not activations). - modelopt/torch/quantization/config.py: add NVFP4_DEFAULT_WEIGHT_ONLY_CFG (W4A16) via the existing _nvfp4_selective_quant_cfg(..., weight_only=True) helper; export via the `choices` set. - modelopt/torch/export/unified_export_hf.py: _process_quantized_modules now also walks quantized Embedding modules (previously is_quantlinear-only), so the NVFP4 packing + scale registration path runs for them on export. - examples/llm_ptq/hf_ptq.py: add `nvfp4_wo` qformat. For model_type == "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer and target the backbone embedding (*embeddings* / *embed_tokens*), overriding the default *lm_head* disable in _default_disabled_quantizer_cfg. - examples/llm_ptq/example_utils.py: environment workarounds so the example runs on transformers 5.5.x's partial Nemotron-H port (idempotent, no-op on fixed transformers): * NemotronHConfig._pattern_to_list: add `-` -> `mlp` * ALLOWED_LAYER_TYPES: add `"mlp"` * NemotronHConfig.validate_layers_block_type: accept `"mlp"` (also update __class_validators__ since @strict_dataclass snapshots it) * MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the layer_idx kwarg passed by NemotronHBlock * NemotronHBlock.__init__: alias block_type=="mlp" -> "moe" so the inline block_type_to_mask lookup in NemotronHModel.forward resolves to None (dispatch is unaffected — the block's forward routes both through the same `else` branch) * generation_config: set do_sample=True when sampling hyperparams are set, so export's save_pretrained passes 5.x strict validation Validated end-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16: python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --qformat nvfp4_wo --kv_cache_qformat none \ --trust_remote_code --dataset cnn_dailymail \ --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \ --export_path /tmp/nemotron_3_nano_4b_nvfp4_wo Produces a 2.1 GB unified HF checkpoint (vs 7.5 GB bf16), with model.embeddings and lm_head both exported as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scales. Follow-ups (separate PRs): - compressed-tensors conversion script for vLLM consumption (rename weight -> weight_packed, weight_scale_2 -> weight_global_scale, rewrite config.json quantization_config to format=nvfp4-pack-quantized). - offline vLLM inference script for the converted checkpoint. - Nemotron-H config.json post-export cleanup (transformers 5.x strips hybrid_override_pattern in favor of the derived layers_block_type, which breaks reload via the checkpoint's remote configuration_nemotron_h.py because layers_block_type there is a read-only property). - optional --vllm-compat hf_ptq flag that also excludes Mamba in_proj (output dim 17504 not divisible by 64, violating vLLM's Marlin repack alignment) so the export is consumable by vLLM out of the box. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai · 2026-04-22T22:34:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds NVFP4 W4A16 weight‑only quantization end‑to‑end (config, selection, exporter packing, tests), registers nn.Embedding quantization and exports embedding weights, adds a generation_config normalization export context, and exposes --exclude_modules in the PTQ CLI and example script.

Changes

Cohort / File(s)	Summary
Export-time generation_config handling `examples/llm_ptq/example_utils.py`	Adds `normalized_generation_config_for_export(model)` context manager and required `contextmanager` import; temporarily normalizes `model.generation_config` during export and restores it afterwards.
PTQ CLI & example workflow `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/scripts/huggingface_example.sh`	Adds `nvfp4_w4a16` qformat; introduces `--exclude_modules` CLI flag (and script propagation); applies exclude patterns to quant config by deep-copying and adding disable rules; wraps export with generation-config normalization and logs/prints export warnings for nvfp4_w4a16.
Exporter identifiers & mapping `modelopt/torch/export/model_config.py`, `modelopt/torch/export/convert_hf_config.py`	Adds `QUANTIZATION_NVFP4_W4A16` constant; maps it to a weights-only FP4 compressed-tensors group config and targets `Linear` and `Embedding` in converted HF quant config.
Unified HF exporter weight flow `modelopt/torch/export/unified_export_hf.py`	Recognizes `QUANTIZATION_NVFP4_W4A16` in quantized-weight export/packing and includes embedding submodules in the quantized-weight export path under fsdp2-aware updates.
Quantization core: NVFP4 weight-only `modelopt/torch/quantization/config.py`, `modelopt/torch/export/quant_utils.py`	Adds weight-only NVFP4 config constant `NVFP4_W4A16_CFG` to choices; wires `nvfp4_w4a16` into detection, scaling/packing, TRT export config generation, and quantized-weight conversion (treats absent input_quantizer as weight-only).
Embedding quantization module & package export `modelopt/torch/quantization/nn/modules/quant_embedding.py`, `modelopt/torch/quantization/nn/__init__.py`	Adds `_QuantEmbedding`/`QuantEmbedding` registered class (per-row linear weight quantization, disables input fake-quant for indices) and re-exports embedding quantization symbols at package level.
TF-safe quantized export behavior `modelopt/torch/export/quant_utils.py`, `modelopt/torch/export/unified_export_hf.py`	Extends NVFP4 weight packing and weight-scale registration/logic to handle NVFP4 W4A16 alongside other NVFP4 variants.
Tests `tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py`	Adds `nvfp4_w4a16` case to unified HF export + safetensors verification matrix (sanity/presence checks with fuse/scale flags off).
Recipe `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`	Adds PTQ recipe enabling NVFP4 W4A16 weight-only quantization for Nemotron‑H with selective exclusions and explicit enables for embeddings and lm_head.
Changelog `CHANGELOG.rst`	Documents NVFP4 W4A16 weight‑only quantization, embedding export support, and the new `--exclude_modules` CLI hook.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script
    participant HF_PTQ
    participant Quant_Config
    participant Exporter
    participant Filesystem

    User->>Script: invoke (QFORMAT=nvfp4_w4a16, EXCLUDE_MODULES)
    Script->>HF_PTQ: run hf_ptq with args
    HF_PTQ->>Quant_Config: build/modify quant config (NVFP4_W4A16_CFG, exclusions)
    HF_PTQ->>HF_PTQ: apply normalized_generation_config_for_export()
    HF_PTQ->>Exporter: export_quantized()
    Exporter->>Filesystem: write quantized safetensors (include embeddings, NVFP4 packing)
    Exporter->>User: emit export path & runtime warnings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 68.42% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main changes: adding NVFP4 W4A16 recipe support and enabling quantization of lm_head and embedding layers for Nemotron-H.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Comprehensive security scan found no instances of torch.load weights_only=False, numpy.load allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on untrusted input, nosec bypass comments, or non-permissive dependencies.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch ajrasane/nemotron-3-nano

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-04-22T22:36:53Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1327/
Built to branch `gh-pages` at 2026-05-01 18:54 UTC. Preview will be ready when the GitHub Pages deployment is complete.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

examples/llm_ptq/hf_ptq.py (1)
102-123: ⚠️ Potential issue | 🟠 Major

nvfp4_wo is still rejected by auto_quantize().

The new choice is exposed here, but the hard-coded qformat allowlist in auto_quantize() (Lines 325-344) was not updated. --auto_quantize_bits --qformat nvfp4_wo,... now fails the assertion even though the format is advertised.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 102 - 123, The qformat "nvfp4_wo"
was added to QUANT_CFG_CHOICES but not included in the hard-coded qformat
allowlist inside auto_quantize(), causing the assertion failure; update the
allowlist in the auto_quantize() function to include "nvfp4_wo" (or extend the
allowlist to derive keys from QUANT_CFG_CHOICES) so that --auto_quantize_bits
--qformat nvfp4_wo is accepted; search for the auto_quantize function and add
"nvfp4_wo" to the qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 853-867: The current code mutates the live model.generation_config
(gen_cfg) which makes the same model instance used by get_model()
non-deterministic; instead, create a copy of the generation_config (e.g., via
copy.deepcopy or by constructing a new GenerationConfig from the dict) and
modify the copy’s do_sample flag, leaving model.generation_config unchanged;
update the export/normalization logic around gen_cfg to use this gen_cfg_copy
(or a temporary variable) so previews/full_model.generate() remain deterministic
and only the exported metadata contains the normalized setting.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only appends a "*lm_head*weight_quantizer" override which causes
activation-quantized recipes (e.g., fp8/nvfp4 applied by mono_quantize) to
become mixed-format at lm_head; update this function so it either (A) checks the
applied recipe/weight_quantizer_cfg and only appends the lm_head weight override
for weight-only formats, or (B) when adding "*lm_head*weight_quantizer" also
append a corresponding "*lm_head*input_quantizer" entry that mirrors the base
input-quantizer entry (use copy.deepcopy of the existing input_quantizer config)
so lm_head keeps the same activation format as the rest of the model; reference
_enable_lm_head_and_embedding_quantization, the "quant_cfg" list entries, and
mono_quantize when implementing the conditional or mirrored input_quantizer
addition.

---

Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 102-123: The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but
not included in the hard-coded qformat allowlist inside auto_quantize(), causing
the assertion failure; update the allowlist in the auto_quantize() function to
include "nvfp4_wo" (or extend the allowlist to derive keys from
QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted;
search for the auto_quantize function and add "nvfp4_wo" to the
qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a191d33d-6d3e-4cf2-abd0-8b9541d5908e

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 43c3454.

📒 Files selected for processing (6)

examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/config.py
modelopt/torch/quantization/nn/__init__.py
modelopt/torch/quantization/nn/modules/quant_embedding.py

coderabbitai · 2026-04-22T22:43:16Z

+    # Some model cards ship a generation_config.json that sets sampling hyperparameters
+    # (top_p, temperature) without ``do_sample=True`` (e.g. NVIDIA-Nemotron-3-Nano-4B-BF16).
+    # transformers 5.x strictly validates this on save_pretrained, so the export step
+    # fails with "GenerationConfig is invalid". Normalize by enabling do_sample whenever
+    # a sampling hyperparameter is set — this is only metadata, not behavior during
+    # calibration or export.
+    gen_cfg = getattr(model, "generation_config", None)
+    if gen_cfg is not None and not getattr(gen_cfg, "do_sample", False):
+        has_sampling_hyperparam = (
+            getattr(gen_cfg, "top_p", None) not in (None, 1.0)
+            or getattr(gen_cfg, "top_k", None) not in (None, 0, 50)
+            or getattr(gen_cfg, "temperature", None) not in (None, 1.0)
+        )
+        if has_sampling_hyperparam:
+            gen_cfg.do_sample = True


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/llm_ptq && head -n 1 example_utils.py && wc -l example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 186

🏁 Script executed:

cd examples/llm_ptq && sed -n '830,880p' example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2298

🏁 Script executed:

cd examples/llm_ptq && grep -n "def get_model" example_utils.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 84

🏁 Script executed:

cd examples/llm_ptq && rg "get_model\(" hf_ptq.py -A 5 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 430

🏁 Script executed:

cd examples/llm_ptq && rg "\.generate\(" hf_ptq.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 1010

🏁 Script executed:

cd examples/llm_ptq && sed -n '1,50p' hf_ptq.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1060

🏁 Script executed:

cd examples/llm_ptq && grep -n "full_model" hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1080

🏁 Script executed:

cd examples/llm_ptq && sed -n '300,350p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1721

🏁 Script executed:

cd examples/llm_ptq && sed -n '334,345p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 465

🏁 Script executed:

cd examples/llm_ptq && sed -n '350,365p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 645

🏁 Script executed:

cd examples/llm_ptq && grep -n "generated_ids_before_ptq" hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 816

🏁 Script executed:

cd examples/llm_ptq && sed -n '257,270p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 653

🏁 Script executed:

cd examples/llm_ptq && sed -n '580,600p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 689

🏁 Script executed:

cd examples/llm_ptq && sed -n '880,925p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1892

🏁 Script executed:

cd examples/llm_ptq && sed -n '1100,1125p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1108

🏁 Script executed:

cd examples/llm_ptq && sed -n '970,1020p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2331

🏁 Script executed:

cd examples/llm_ptq && sed -n '1180,1220p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1472

Don't mutate the live generation_config in get_model().

The mutation persists on the returned model object, and both the before-PTQ and after-PTQ preview calls (full_model.generate() at lines 922 and 980 in hf_ptq.py) use that same model instance. For checkpoints with sampling hyperparameters, this makes the previews non-deterministic instead of deterministic, undermining PTQ smoke test comparisons. Normalize a copy during export instead.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 853 - 867, The current code mutates the live model.generation_config (gen_cfg) which makes the same model instance used by get_model() non-deterministic; instead, create a copy of the generation_config (e.g., via copy.deepcopy or by constructing a new GenerationConfig from the dict) and modify the copy’s do_sample flag, leaving model.generation_config unchanged; update the export/normalization logic around gen_cfg to use this gen_cfg_copy (or a temporary variable) so previews/full_model.generate() remain deterministic and only the exported metadata contains the normalized setting.

…mat ID Integrates the scaffolding from PR #1313 (NVFP4 W4A16 generic support) into this branch so the two PRs don't diverge on naming or export-path coverage. The embedding + Nemotron-H enablement in the previous commit is unchanged; this commit just adopts #1313's conventions for the pieces that overlap. Changes - modelopt/torch/quantization/config.py: rename the W4A16 recipe constant NVFP4_DEFAULT_WEIGHT_ONLY_CFG -> NVFP4_W4A16_CFG to match #1313. - modelopt/torch/export/model_config.py: add QUANTIZATION_NVFP4_W4A16 as a distinct format ID instead of relying on NVFP4 branches tolerating a disabled input_quantizer. - modelopt/torch/export/quant_utils.py: thread NVFP4_W4A16 through get_weight_scaling_factor, get_weight_scaling_factor_2, to_quantized_weight, and the nvfp4_w4a16 branch of process_layer_quant_config. Add explicit W4A16 detection in _get_quantization_from_layer when input_quantizer is absent/disabled. - modelopt/torch/export/unified_export_hf.py: add NVFP4_W4A16 to the weight_scale_2 registration and NVFP4 transpose lists. - modelopt/torch/export/convert_hf_config.py: add NVFP4_W4A16 mapping in _quant_algo_to_group_config and convert_hf_quant_config_format so the llm-compressor conversion emits a weight-only config group. - examples/llm_ptq/hf_ptq.py: rename qformat nvfp4_wo -> nvfp4_w4a16; add --exclude_modules CLI (composes with the Nemotron-H helpers added in the previous commit); emit a post-export vLLM deployment warning. - examples/llm_ptq/scripts/huggingface_example.sh: add nvfp4_w4a16 to the qformat allowlist, EXCLUDE_MODULES env pass-through, and a W4A16 export notice. - CHANGELOG.rst: document W4A16 (covers the overlap with #1313) and the Embedding / Nemotron-H enablement unique to this PR. - tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py: add an nvfp4_w4a16 parametrize entry for the tiny_llama fixture. pre-commit (ruff, ruff-format, mypy, bandit, insert-license, rst-lint) passes on all touched files. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

examples/llm_ptq/hf_ptq.py (1)
597-637: ⚠️ Potential issue | 🟠 Major

Keep lm_head quantization aligned with the base Nemotron-H recipe.

This helper only re-enables *lm_head*weight_quantizer. When Nemotron-H runs with an activation-aware recipe such as fp8 or nvfp4, lm_head stops matching the rest of the model; for NVFP4, modelopt/torch/export/quant_utils.py now even reclassifies it as nvfp4_w4a16 because the input quantizer stays disabled. Either gate this helper to weight-only recipes or append a mirrored *lm_head*input_quantizer rule copied from the base config.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 597 - 637, The helper
_enable_lm_head_and_embedding_quantization currently only re-enables the lm_head
weight quantizer which desynchronizes lm_head when activation-aware recipes
(e.g. fp8/nvfp4) are used; update it to either (A) only run when the active
recipe is weight-only (check quant_cfg["algorithm"] or similar indicator) OR (B)
also append a mirrored "*lm_head*input_quantizer" entry copied from the
base/input quantizer config so lm_head keeps the same input quantization as the
rest of the model; modify _enable_lm_head_and_embedding_quantization to perform
one of these two fixes and ensure the new entry uses copy.deepcopy like the
existing weight entries.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 686-697: The Nemotron-H opt-in currently adds enable/config
entries after quantize_main() has already appended user --exclude_modules rules
(so it can override user exclusions); update the flow so user exclusions are
respected by either moving the
_enable_lm_head_and_embedding_quantization(quant_cfg, weight_quantizer_cfg) call
to run before quantize_main()/before mono_quantize() applies exclude updates, or
(preferably) change _enable_lm_head_and_embedding_quantization to check
quant_cfg.exclude_modules (and any existing disable rules) and skip adding
enable/config entries for "lm_head" or "embeddings" if the user explicitly
excluded them; make this change around quantize_main(), mono_quantize(), and
_enable_lm_head_and_embedding_quantization so the user's --exclude_modules is
never silently undone.

In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 130-132: The wrapper currently expands shell globs because
EXCLUDE_MODULES is injected unquoted into PTQ_ARGS before calling hf_ptq.py; fix
this by preserving the literal pattern when forwarding --exclude_modules: stop
building a single unquoted string and use a bash array for PTQ_ARGS (e.g.,
append the two separate elements "--exclude_modules" and "$EXCLUDE_MODULES") or
otherwise ensure the variable is quoted when added so hf_ptq.py receives the
exact pattern (references: EXCLUDE_MODULES, PTQ_ARGS, and the --exclude_modules
argument passed to hf_ptq.py).

In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 191-198: The NVFP4_W4A16 branch sets config_group_details.targets
to only ["Linear"], which omits embeddings even though weight-only quantization
applies to Embedding layers; update the targets list in the NVFP4_W4A16 branch
(where quant_algo_value == "NVFP4_W4A16" and config_group_details is built) to
include "Embedding" (e.g., ["Linear", "Embedding"]) before assigning
new_config["config_groups"] so compressed-tensors exports match the actual
NVFP4_W4A16 quantization coverage.

---

Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only re-enables the lm_head weight quantizer which desynchronizes
lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to
either (A) only run when the active recipe is weight-only (check
quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored
"*lm_head*input_quantizer" entry copied from the base/input quantizer config so
lm_head keeps the same input quantization as the rest of the model; modify
_enable_lm_head_and_embedding_quantization to perform one of these two fixes and
ensure the new entry uses copy.deepcopy like the existing weight entries.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7568812b-30c3-4c61-bbc4-d04ef8fc9364

📥 Commits

Reviewing files that changed from the base of the PR and between 43c3454 and 490b6b2.

📒 Files selected for processing (9)

CHANGELOG.rst
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/scripts/huggingface_example.sh
modelopt/torch/export/convert_hf_config.py
modelopt/torch/export/model_config.py
modelopt/torch/export/quant_utils.py
modelopt/torch/export/unified_export_hf.py
modelopt/torch/quantization/config.py
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py

✅ Files skipped from review due to trivial changes (2)

modelopt/torch/export/model_config.py
CHANGELOG.rst

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/torch/export/unified_export_hf.py

- example_utils: swap in a normalized generation_config via context manager during export instead of mutating the live one in get_model() — preview generate() calls stay deterministic - hf_ptq: mirror *lm_head*input_quantizer for activation-aware recipes so lm_head doesn't silently downgrade to W4A16 under NVFP4/FP8 - hf_ptq: respect --exclude_modules in the Nemotron-H lm_head/embedding override so user exclusions aren't silently undone - hf_ptq: add nvfp4_w4a16 to the auto_quantize qformat allowlist for consistency with QUANT_CFG_CHOICES - huggingface_example.sh: pass --exclude_modules via a bash array (set -f) so wildcard patterns like *embed_tokens* reach argparse verbatim instead of being glob-expanded against the filesystem - convert_hf_config: include Embedding in the NVFP4_W4A16 target set so compressed-tensors consumers dispatch on quantized embedding weights Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

coderabbitai

🧹 Nitpick comments (1)

examples/llm_ptq/example_utils.py (1)

873-876: Consider documenting the top_k=50 check.

The check getattr(original, "top_k", None) not in (None, 0, 50) treats top_k=50 as a "default/unset" value. While 50 is indeed the transformers default, this implicit knowledge could benefit from a brief inline comment for maintainability.

💡 Suggested clarification

         has_sampling_hyperparam = (
             getattr(original, "top_p", None) not in (None, 1.0)
-            or getattr(original, "top_k", None) not in (None, 0, 50)
+            or getattr(original, "top_k", None) not in (None, 0, 50)  # 50 is transformers default
             or getattr(original, "temperature", None) not in (None, 1.0)
         )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 873 - 876, The condition
using getattr(original, "top_k", None) not in (None, 0, 50) implicitly treats
top_k==50 as a default/unset value; add a brief inline comment next to that
expression (or expand the surrounding docstring) stating that 50 is the
HuggingFace/transformers default so it should be treated as unset, e.g., “# 50
is transformers' default top_k, treat as unset”; ensure the comment references
getattr(original, "top_k", None) so future readers understand why 50 is
excluded.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 873-876: The condition using getattr(original, "top_k", None) not
in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a
brief inline comment next to that expression (or expand the surrounding
docstring) stating that 50 is the HuggingFace/transformers default so it should
be treated as unset, e.g., “# 50 is transformers' default top_k, treat as
unset”; ensure the comment references getattr(original, "top_k", None) so future
readers understand why 50 is excluded.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0eeb4272-c356-455a-9a92-0ae3a6fbd489

📥 Commits

Reviewing files that changed from the base of the PR and between 490b6b2 and a115c88.

📒 Files selected for processing (4)

examples/llm_ptq/example_utils.py
examples/llm_ptq/hf_ptq.py
examples/llm_ptq/scripts/huggingface_example.sh
modelopt/torch/export/convert_hf_config.py

meenchen · 2026-04-24T23:24:24Z

    mts.export(full_model)


+def _enable_lm_head_and_embedding_quantization(


Can we define this in the modelop_recipe if everything modelopt_recipes/models can be captured with our yaml recipe system?

meenchen · 2026-04-24T23:27:24Z

+    # For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B),
+    # extend quantization coverage to the lm_head and the input token embedding. On this
+    # architecture those two 131072x3136 tables account for ~21% of parameters, so leaving
+    # them at bf16 wastes most of the NVFP4 memory benefit.
+    if model_type == "nemotron_h":
+        weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer")
+        if weight_quantizer_cfg is not None:
+            # ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4,
+            # ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and
+            # ``lm_head`` stays weight-only along with the embedding.
+            input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer")
+            print(
+                "Nemotron-H detected: extending quantization to lm_head and input embedding "
+                "(backbone.embeddings)."
+            )
+            _enable_lm_head_and_embedding_quantization(
+                quant_cfg,
+                weight_quantizer_cfg,
+                input_quantizer_cfg=input_quantizer_cfg,
+                user_excluded_modules=args.exclude_modules or None,
+            )
+        else:
+            warnings.warn(
+                "Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; "
+                "skipping lm_head/embedding extension (model-specific or non-standard recipe)."
+            )
+


Same as the previous comment, wondering if our recipe system can replace this ad hoc change to support specific models

juhi10071998 · 2026-04-28T16:40:43Z

+                if args.qformat == "nvfp4_w4a16":
+                    warnings.warn(
+                        "TensorRT-LLM and SGLang do not support this format. "
+                        "To serve on vLLM, convert the NVFP4 W4A16 checkpoint to compressed-tensors format."


hi @ajrasane , should we point the users to how they can convert? do we have a helper in ModelOpt we should point them to?

@hychiang-git, are you planning to merge your conversion script to modelopt?

…ML recipe Move the Nemotron-H-specific quantization extensions out of `hf_ptq.py` and into a declarative recipe at `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`, addressing PR #1327 review feedback. The recipe captures exactly what the removed `_enable_lm_head_and_embedding_quantization` helper did: * All Linear weight quantizers ON (NVFP4 W4A16, group_size 16, scale_bits e4m3). * Standard `_default_disabled_quantizer_cfg` exclusions (BatchNorm, conv1d, etc.). * `*lm_head*weight_quantizer`, `*embeddings*weight_quantizer`, and `*embed_tokens*weight_quantizer` re-enabled AFTER the default disables so they take precedence (last matching entry wins). Drop the helpers (`_enable_lm_head_and_embedding_quantization`, `_extract_wildcard_quantizer_cfg`) and the `if model_type == "nemotron_h":` block in `mono_quantize`. Users now opt in explicitly via `--recipe models/Nemotron-H/nvfp4_w4a16` instead of relying on auto-detection. Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (RTX 6000 Ada, calib_size=16, calib_seq=256): 94 weight quantizers enabled and 21 disabled (the Mamba `*mixer.conv1d*` layers), `lm_head.weight_quantizer` and `model.embeddings.weight_quantizer` carry NVFP4 cfg, exported safetensors is 2.13 GiB (matches prior PR-validation export size), and `hf_quant_config.json` reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d layers]`. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

`_maybe_patch_transformers_nemotron_h_mixer_types()` papered over three gaps in transformers 5.5.x's built-in Nemotron-H port: 1. `NemotronHConfig._pattern_to_list` not mapping `-` → mlp 2. `MIXER_TYPES` not registering `mlp` 3. `ALLOWED_LAYER_TYPES` not including `mlp` modelopt's supported transformers floor is 4.56 (per `pyproject.toml` and `modelopt/torch/__init__.py`). On 4.56.x the upstream Nemotron-H module doesn't exist at all (`transformers.models.nemotron_h` is missing) — so the patch's three branches all hit `ImportError` and silently no-op. With `--trust_remote_code`, the Nemotron-H checkpoint's bundled `configuration_nemotron_h.py` / `modeling_nemotron_h.py` carry their own `MIXER_TYPES`, `_pattern_to_list`, and `validate_layers_block_type`, so the ad-hoc transformers patches were never load-bearing for that codepath either. Drop the helper and the `_maybe_patch_transformers_nemotron_h_mixer_types()` call in `get_model()`. Smoke-tested on `transformers 4.56.2` with `AutoConfig.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16", trust_remote_code=True)` — config and model load cleanly without the patch. Users running on transformers 5.5.x (modelopt's experimental band) who attempt to load Nemotron-H without `--trust_remote_code` will hit the upstream gap directly; the recommended path remains `--trust_remote_code` + the model's bundled remote code. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

The `--exclude_modules` flag was added in this PR as an escape hatch for overriding the auto-applied lm_head/embedding inclusion on Nemotron-H. Now that meenchen's recipe-system review is addressed and the Nemotron-H extensions live in `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`, this flag has no remaining purpose: users who want different exclusions write a different recipe. Removes: * the `--exclude_modules` argparse entry in `hf_ptq.py` * the `args.exclude_modules` apply-loop in `quantize_main()` * the `EXCLUDE_MODULES` env-var passthrough + `EXCLUDE_MODULES_ARGS` bash array in `examples/llm_ptq/scripts/huggingface_example.sh` Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` with `--recipe models/Nemotron-H/nvfp4_w4a16` (transformers 4.56.2, GPU 5, calib_size=16): same coverage as before — 94 weight quantizers enabled, 21 disabled (the Mamba `*mixer.conv1d*` layers); `lm_head.weight_quantizer` and `backbone.embeddings.weight_quantizer` carry NVFP4 W4A16 cfg; exported safetensors 2.1 GiB; `hf_quant_config.json` reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d layers]`. The recipe still dictates the exclusion set, so behavior is unchanged for the supported codepath. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>

ajrasane requested review from a team as code owners April 22, 2026 22:33

ajrasane requested review from meenchen and realAsma April 22, 2026 22:33

ajrasane changed the title ~~feat: quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe~~ Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe Apr 22, 2026

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread examples/llm_ptq/hf_ptq.py Outdated

Comment thread examples/llm_ptq/scripts/huggingface_example.sh Outdated

Comment thread modelopt/torch/export/convert_hf_config.py

ajrasane self-assigned this Apr 23, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

meenchen reviewed Apr 24, 2026

View reviewed changes

juhi10071998 reviewed Apr 28, 2026

View reviewed changes

ajrasane requested a review from a team as a code owner May 1, 2026 18:23

ajrasane added 2 commits May 1, 2026 18:42

		mts.export(full_model)


		def _enable_lm_head_and_embedding_quantization(

Conversation

ajrasane commented Apr 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related PR

Summary

Changes

Core quantization library

Export

hf_ptq.py example

example_utils.py — environment workarounds

Validation

Follow-ups (separate PRs)

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

github-actions Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch gh-pages at 2026-05-01 18:54 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

meenchen Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

meenchen Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

juhi10071998 Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

ajrasane May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ajrasane commented Apr 22, 2026 •

edited by coderabbitai Bot

Loading

`hf_ptq.py` example

`example_utils.py` — environment workarounds

coderabbitai Bot commented Apr 22, 2026 •

edited

Loading

github-actions Bot commented Apr 22, 2026 •

edited

Loading

Built to branch `gh-pages` at 2026-05-01 18:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.