Skip to content

Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327

Open
ajrasane wants to merge 6 commits intomainfrom
ajrasane/nemotron-3-nano
Open

Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
ajrasane wants to merge 6 commits intomainfrom
ajrasane/nemotron-3-nano

Conversation

@ajrasane
Copy link
Copy Markdown
Contributor

@ajrasane ajrasane commented Apr 22, 2026

Related PR

Aligns with #1313 (Support NVFP4 W4A16 quantization) — shares the NVFP4_W4A16_CFG recipe, the nvfp4_w4a16 qformat, and the QUANTIZATION_NVFP4_W4A16 format ID. This PR adds the embedding + lm_head quantization support on top.

Summary

Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two 131072×3136 tables are ~21% of parameters and leaving them in bf16 wastes most of the compression.

Changes

Core quantization library

  • modelopt/torch/quantization/nn/modules/quant_embedding.py (new) — Register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that inherits QuantLinearConvBase but disables the input_quantizer by default (embedding inputs are integer indices, not activations; output_quantizer is already disabled by QuantInputBase._setup).
  • modelopt/torch/quantization/nn/__init__.py — Import the new quant_embedding module so registration fires at library import time.

Export

  • modelopt/torch/export/unified_export_hf.py_process_quantized_modules now also walks quantized Embedding modules (previously is_quantlinear-only), so the NVFP4 packing + scale registration path in _export_quantized_weight runs for them on export.

hf_ptq.py example

  • For model_type == "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer and target the backbone embedding (*embeddings* / *embed_tokens*), overriding the default *lm_head* disable in _default_disabled_quantizer_cfg. Guarded helpers (_enable_lm_head_and_embedding_quantization, _extract_weight_quantizer_cfg) so the override only fires when a standard *weight_quantizer entry is present.

example_utils.py — environment workarounds

These are idempotent workarounds for transformers 5.5.x's partial Nemotron-H port; they no-op on a fixed transformers (e.g. inside the TRT Docker container's newer wheel):

  • NemotronHConfig._pattern_to_list: add -mlp
  • ALLOWED_LAYER_TYPES: add "mlp"
  • NemotronHConfig.validate_layers_block_type: accept "mlp" (also update __class_validators__ since huggingface_hub's @strict_dataclass snapshots validators at class-creation time, so overwriting the method attribute alone isn't enough)
  • MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the layer_idx kwarg passed by NemotronHBlock
  • NemotronHBlock.__init__: alias block_type == "mlp""moe" so the inline block_type_to_mask lookup in NemotronHModel.forward resolves to None (dispatch is unaffected — the block's forward routes both through the same else branch that calls self.mixer(hidden_states))
  • generation_config: set do_sample=True when sampling hyperparams are set, so export's save_pretrained passes transformers 5.x strict validation

Validation

End-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:

python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
    --qformat nvfp4_w4a16 --kv_cache_qformat none \
    --trust_remote_code --dataset cnn_dailymail \
    --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
    --export_path /tmp/nemotron_3_nano_4b_nvfp4_w4a16

Produces a 2.2 GB unified HF checkpoint (vs 7.5 GB bf16), with model.embeddings.weight and lm_head.weight both stored as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scale. hf_quant_config.json reports quant_algo: NVFP4_W4A16, group_size: 16, and exclude_modules contains only the 21 Mamba conv1d layers (the default _default_disabled_quantizer_cfg entry for *mixer.conv1d*).

pre-commit run --files <staged> passes (ruff, ruff-format, mypy, bandit, insert-license, rst-lint).

Follow-ups (separate PRs)

  • Compressed-tensors conversion script for vLLM consumption: renames *.weight → *.weight_packed, *.weight_scale_2 → *.weight_global_scale (inverted), and rewrites config.json quantization_config to format: nvfp4-pack-quantized / quant_method: compressed-tensors. Already prototyped out-of-tree; just needs cleanup + tests.
  • Offline vLLM inference script for the converted checkpoint (CLI wrapping vllm.LLM with chat-template rendering, max_model_len cap, --enforce-eager default for Mamba/SSM). Already prototyped out-of-tree.
  • Nemotron-H config.json post-export cleanup: transformers 5.x strips hybrid_override_pattern in favor of the derived layers_block_type list, which breaks reload via the checkpoint's remote configuration_nemotron_h.py (its layers_block_type is a read-only @property). The export path should restore hybrid_override_pattern and set num_hidden_layers explicitly for model_type == "nemotron_h".
  • Optional --vllm-compat hf_ptq flag that additionally excludes Mamba in_proj (output dim 17504 = intermediate + conv_dim + num_heads isn't divisible by 64, violating vLLM's Marlin repack alignment) and leaves lm_head / model.embeddings in bf16 (vLLM's ParallelLMHead / VocabParallelEmbedding don't consume compressed-tensors scales), so the export is consumable by vLLM out of the box.
  • Upstream the transformers 5.5.x Nemotron-H fixes so the example_utils.py monkey-patches can be dropped.

Test plan

  • pre-commit (ruff, ruff-format, mypy, bandit, insert-license) passes on the staged files.
  • Smoke test that nn.Embedding registers and is replaced with QuantEmbedding under mtq.quantize(..., NVFP4_W4A16_CFG, forward_loop=None) on a toy Sequential(Embedding, Linear) model; verified forward pass on CUDA.
  • End-to-end PTQ + unified HF export on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 (see Validation above).
  • GPU unit/integration test under tests/gpu/torch/export/ for nn.Embedding weight packing — follow-up once the conversion/load path lands.
  • Multi-GPU / tensor-parallel export path — not exercised; Nemotron-H's accelerate-plus-multi-GPU path is already flagged as known-broken in hf_ptq.py, and this PR doesn't change that.

Summary by CodeRabbit

  • New Features

    • NVFP4 W4A16 weight‑only quantization added (embeddings + lm_head support) and new PTQ recipe.
    • Embedding quantization module exported for use in the pipeline.
    • Added --exclude_modules CLI to selectively omit modules from quantization.
    • Export now temporarily normalizes generation settings during export and warns about nvfp4_w4a16 runtime compatibility.
  • Integration

    • Exporter, conversion, and example scripts updated to recognize nvfp4_w4a16 and pass exclusion args.
  • Tests

    • Added export/safetensors test case for nvfp4_w4a16.

…cipe

Extends ModelOpt PTQ so the input token embedding and output LM head can
participate in NVFP4 quantization, and wires that up for Nemotron-H where
those two 131072x3136 tables are ~21% of parameters and leaving them in
bf16 wastes most of the compression.

Changes
- modelopt/torch/quantization/nn/modules/quant_embedding.py (new):
  register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that
  inherits QuantLinearConvBase but disables the input_quantizer by default
  (embedding inputs are integer indices, not activations).
- modelopt/torch/quantization/config.py: add NVFP4_DEFAULT_WEIGHT_ONLY_CFG
  (W4A16) via the existing _nvfp4_selective_quant_cfg(..., weight_only=True)
  helper; export via the `choices` set.
- modelopt/torch/export/unified_export_hf.py: _process_quantized_modules now
  also walks quantized Embedding modules (previously is_quantlinear-only),
  so the NVFP4 packing + scale registration path runs for them on export.
- examples/llm_ptq/hf_ptq.py: add `nvfp4_wo` qformat. For model_type ==
  "nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer
  and target the backbone embedding (*embeddings* / *embed_tokens*),
  overriding the default *lm_head* disable in _default_disabled_quantizer_cfg.
- examples/llm_ptq/example_utils.py: environment workarounds so the example
  runs on transformers 5.5.x's partial Nemotron-H port (idempotent, no-op
  on fixed transformers):
    * NemotronHConfig._pattern_to_list: add `-` -> `mlp`
    * ALLOWED_LAYER_TYPES: add `"mlp"`
    * NemotronHConfig.validate_layers_block_type: accept `"mlp"` (also
      update __class_validators__ since @strict_dataclass snapshots it)
    * MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the
      layer_idx kwarg passed by NemotronHBlock
    * NemotronHBlock.__init__: alias block_type=="mlp" -> "moe" so the
      inline block_type_to_mask lookup in NemotronHModel.forward resolves
      to None (dispatch is unaffected — the block's forward routes both
      through the same `else` branch)
    * generation_config: set do_sample=True when sampling hyperparams are
      set, so export's save_pretrained passes 5.x strict validation

Validated end-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:
  python examples/llm_ptq/hf_ptq.py \
      --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
      --qformat nvfp4_wo --kv_cache_qformat none \
      --trust_remote_code --dataset cnn_dailymail \
      --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
      --export_path /tmp/nemotron_3_nano_4b_nvfp4_wo
Produces a 2.1 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings and lm_head both exported as packed NVFP4 uint8 +
FP8 per-block scales + FP32 global scales.

Follow-ups (separate PRs):
- compressed-tensors conversion script for vLLM consumption (rename
  weight -> weight_packed, weight_scale_2 -> weight_global_scale, rewrite
  config.json quantization_config to format=nvfp4-pack-quantized).
- offline vLLM inference script for the converted checkpoint.
- Nemotron-H config.json post-export cleanup (transformers 5.x strips
  hybrid_override_pattern in favor of the derived layers_block_type,
  which breaks reload via the checkpoint's remote configuration_nemotron_h.py
  because layers_block_type there is a read-only property).
- optional --vllm-compat hf_ptq flag that also excludes Mamba in_proj
  (output dim 17504 not divisible by 64, violating vLLM's Marlin repack
  alignment) so the export is consumable by vLLM out of the box.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane requested review from a team as code owners April 22, 2026 22:33
@ajrasane ajrasane requested review from meenchen and realAsma April 22, 2026 22:33
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds NVFP4 W4A16 weight‑only quantization end‑to‑end (config, selection, exporter packing, tests), registers nn.Embedding quantization and exports embedding weights, adds a generation_config normalization export context, and exposes --exclude_modules in the PTQ CLI and example script.

Changes

Cohort / File(s) Summary
Export-time generation_config handling
examples/llm_ptq/example_utils.py
Adds normalized_generation_config_for_export(model) context manager and required contextmanager import; temporarily normalizes model.generation_config during export and restores it afterwards.
PTQ CLI & example workflow
examples/llm_ptq/hf_ptq.py, examples/llm_ptq/scripts/huggingface_example.sh
Adds nvfp4_w4a16 qformat; introduces --exclude_modules CLI flag (and script propagation); applies exclude patterns to quant config by deep-copying and adding disable rules; wraps export with generation-config normalization and logs/prints export warnings for nvfp4_w4a16.
Exporter identifiers & mapping
modelopt/torch/export/model_config.py, modelopt/torch/export/convert_hf_config.py
Adds QUANTIZATION_NVFP4_W4A16 constant; maps it to a weights-only FP4 compressed-tensors group config and targets Linear and Embedding in converted HF quant config.
Unified HF exporter weight flow
modelopt/torch/export/unified_export_hf.py
Recognizes QUANTIZATION_NVFP4_W4A16 in quantized-weight export/packing and includes embedding submodules in the quantized-weight export path under fsdp2-aware updates.
Quantization core: NVFP4 weight-only
modelopt/torch/quantization/config.py, modelopt/torch/export/quant_utils.py
Adds weight-only NVFP4 config constant NVFP4_W4A16_CFG to choices; wires nvfp4_w4a16 into detection, scaling/packing, TRT export config generation, and quantized-weight conversion (treats absent input_quantizer as weight-only).
Embedding quantization module & package export
modelopt/torch/quantization/nn/modules/quant_embedding.py, modelopt/torch/quantization/nn/__init__.py
Adds _QuantEmbedding/QuantEmbedding registered class (per-row linear weight quantization, disables input fake-quant for indices) and re-exports embedding quantization symbols at package level.
TF-safe quantized export behavior
modelopt/torch/export/quant_utils.py, modelopt/torch/export/unified_export_hf.py
Extends NVFP4 weight packing and weight-scale registration/logic to handle NVFP4 W4A16 alongside other NVFP4 variants.
Tests
tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
Adds nvfp4_w4a16 case to unified HF export + safetensors verification matrix (sanity/presence checks with fuse/scale flags off).
Recipe
modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml
Adds PTQ recipe enabling NVFP4 W4A16 weight-only quantization for Nemotron‑H with selective exclusions and explicit enables for embeddings and lm_head.
Changelog
CHANGELOG.rst
Documents NVFP4 W4A16 weight‑only quantization, embedding export support, and the new --exclude_modules CLI hook.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Script
    participant HF_PTQ
    participant Quant_Config
    participant Exporter
    participant Filesystem

    User->>Script: invoke (QFORMAT=nvfp4_w4a16, EXCLUDE_MODULES)
    Script->>HF_PTQ: run hf_ptq with args
    HF_PTQ->>Quant_Config: build/modify quant config (NVFP4_W4A16_CFG, exclusions)
    HF_PTQ->>HF_PTQ: apply normalized_generation_config_for_export()
    HF_PTQ->>Exporter: export_quantized()
    Exporter->>Filesystem: write quantized safetensors (include embeddings, NVFP4 packing)
    Exporter->>User: emit export path & runtime warnings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.42% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately captures the main changes: adding NVFP4 W4A16 recipe support and enabling quantization of lm_head and embedding layers for Nemotron-H.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Comprehensive security scan found no instances of torch.load weights_only=False, numpy.load allow_pickle=True, hardcoded trust_remote_code=True, eval/exec on untrusted input, nosec bypass comments, or non-permissive dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajrasane/nemotron-3-nano

Comment @coderabbitai help to get the list of available commands and usage tips.

@ajrasane ajrasane changed the title feat: quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe Apr 22, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 22, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1327/

Built to branch gh-pages at 2026-05-01 18:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)

102-123: ⚠️ Potential issue | 🟠 Major

nvfp4_wo is still rejected by auto_quantize().

The new choice is exposed here, but the hard-coded qformat allowlist in auto_quantize() (Lines 325-344) was not updated. --auto_quantize_bits --qformat nvfp4_wo,... now fails the assertion even though the format is advertised.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 102 - 123, The qformat "nvfp4_wo"
was added to QUANT_CFG_CHOICES but not included in the hard-coded qformat
allowlist inside auto_quantize(), causing the assertion failure; update the
allowlist in the auto_quantize() function to include "nvfp4_wo" (or extend the
allowlist to derive keys from QUANT_CFG_CHOICES) so that --auto_quantize_bits
--qformat nvfp4_wo is accepted; search for the auto_quantize function and add
"nvfp4_wo" to the qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 853-867: The current code mutates the live model.generation_config
(gen_cfg) which makes the same model instance used by get_model()
non-deterministic; instead, create a copy of the generation_config (e.g., via
copy.deepcopy or by constructing a new GenerationConfig from the dict) and
modify the copy’s do_sample flag, leaving model.generation_config unchanged;
update the export/normalization logic around gen_cfg to use this gen_cfg_copy
(or a temporary variable) so previews/full_model.generate() remain deterministic
and only the exported metadata contains the normalized setting.

In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only appends a "*lm_head*weight_quantizer" override which causes
activation-quantized recipes (e.g., fp8/nvfp4 applied by mono_quantize) to
become mixed-format at lm_head; update this function so it either (A) checks the
applied recipe/weight_quantizer_cfg and only appends the lm_head weight override
for weight-only formats, or (B) when adding "*lm_head*weight_quantizer" also
append a corresponding "*lm_head*input_quantizer" entry that mirrors the base
input-quantizer entry (use copy.deepcopy of the existing input_quantizer config)
so lm_head keeps the same activation format as the rest of the model; reference
_enable_lm_head_and_embedding_quantization, the "quant_cfg" list entries, and
mono_quantize when implementing the conditional or mirrored input_quantizer
addition.

---

Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 102-123: The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but
not included in the hard-coded qformat allowlist inside auto_quantize(), causing
the assertion failure; update the allowlist in the auto_quantize() function to
include "nvfp4_wo" (or extend the allowlist to derive keys from
QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted;
search for the auto_quantize function and add "nvfp4_wo" to the
qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a191d33d-6d3e-4cf2-abd0-8b9541d5908e

📥 Commits

Reviewing files that changed from the base of the PR and between e56682e and 43c3454.

📒 Files selected for processing (6)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/nn/__init__.py
  • modelopt/torch/quantization/nn/modules/quant_embedding.py

Comment thread examples/llm_ptq/example_utils.py Outdated
Comment on lines +853 to +867
# Some model cards ship a generation_config.json that sets sampling hyperparameters
# (top_p, temperature) without ``do_sample=True`` (e.g. NVIDIA-Nemotron-3-Nano-4B-BF16).
# transformers 5.x strictly validates this on save_pretrained, so the export step
# fails with "GenerationConfig is invalid". Normalize by enabling do_sample whenever
# a sampling hyperparameter is set — this is only metadata, not behavior during
# calibration or export.
gen_cfg = getattr(model, "generation_config", None)
if gen_cfg is not None and not getattr(gen_cfg, "do_sample", False):
has_sampling_hyperparam = (
getattr(gen_cfg, "top_p", None) not in (None, 1.0)
or getattr(gen_cfg, "top_k", None) not in (None, 0, 50)
or getattr(gen_cfg, "temperature", None) not in (None, 1.0)
)
if has_sampling_hyperparam:
gen_cfg.do_sample = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/llm_ptq && head -n 1 example_utils.py && wc -l example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 186


🏁 Script executed:

cd examples/llm_ptq && sed -n '830,880p' example_utils.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2298


🏁 Script executed:

cd examples/llm_ptq && grep -n "def get_model" example_utils.py | head -5

Repository: NVIDIA/Model-Optimizer

Length of output: 84


🏁 Script executed:

cd examples/llm_ptq && rg "get_model\(" hf_ptq.py -A 5 -B 2

Repository: NVIDIA/Model-Optimizer

Length of output: 430


🏁 Script executed:

cd examples/llm_ptq && rg "\.generate\(" hf_ptq.py -B 3 -A 3

Repository: NVIDIA/Model-Optimizer

Length of output: 1010


🏁 Script executed:

cd examples/llm_ptq && sed -n '1,50p' hf_ptq.py | head -30

Repository: NVIDIA/Model-Optimizer

Length of output: 1060


🏁 Script executed:

cd examples/llm_ptq && grep -n "full_model" hf_ptq.py | head -20

Repository: NVIDIA/Model-Optimizer

Length of output: 1080


🏁 Script executed:

cd examples/llm_ptq && sed -n '300,350p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1721


🏁 Script executed:

cd examples/llm_ptq && sed -n '334,345p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 465


🏁 Script executed:

cd examples/llm_ptq && sed -n '350,365p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 645


🏁 Script executed:

cd examples/llm_ptq && grep -n "generated_ids_before_ptq" hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 816


🏁 Script executed:

cd examples/llm_ptq && sed -n '257,270p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 653


🏁 Script executed:

cd examples/llm_ptq && sed -n '580,600p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 689


🏁 Script executed:

cd examples/llm_ptq && sed -n '880,925p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1892


🏁 Script executed:

cd examples/llm_ptq && sed -n '1100,1125p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1108


🏁 Script executed:

cd examples/llm_ptq && sed -n '970,1020p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2331


🏁 Script executed:

cd examples/llm_ptq && sed -n '1180,1220p' hf_ptq.py

Repository: NVIDIA/Model-Optimizer

Length of output: 1472


Don't mutate the live generation_config in get_model().

The mutation persists on the returned model object, and both the before-PTQ and after-PTQ preview calls (full_model.generate() at lines 922 and 980 in hf_ptq.py) use that same model instance. For checkpoints with sampling hyperparameters, this makes the previews non-deterministic instead of deterministic, undermining PTQ smoke test comparisons. Normalize a copy during export instead.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 853 - 867, The current code
mutates the live model.generation_config (gen_cfg) which makes the same model
instance used by get_model() non-deterministic; instead, create a copy of the
generation_config (e.g., via copy.deepcopy or by constructing a new
GenerationConfig from the dict) and modify the copy’s do_sample flag, leaving
model.generation_config unchanged; update the export/normalization logic around
gen_cfg to use this gen_cfg_copy (or a temporary variable) so
previews/full_model.generate() remain deterministic and only the exported
metadata contains the normalized setting.

Comment thread examples/llm_ptq/hf_ptq.py Outdated
…mat ID

Integrates the scaffolding from PR #1313 (NVFP4 W4A16 generic support) into
this branch so the two PRs don't diverge on naming or export-path coverage.
The embedding + Nemotron-H enablement in the previous commit is unchanged;
this commit just adopts #1313's conventions for the pieces that overlap.

Changes
- modelopt/torch/quantization/config.py: rename the W4A16 recipe constant
  NVFP4_DEFAULT_WEIGHT_ONLY_CFG -> NVFP4_W4A16_CFG to match #1313.
- modelopt/torch/export/model_config.py: add QUANTIZATION_NVFP4_W4A16 as a
  distinct format ID instead of relying on NVFP4 branches tolerating a
  disabled input_quantizer.
- modelopt/torch/export/quant_utils.py: thread NVFP4_W4A16 through
  get_weight_scaling_factor, get_weight_scaling_factor_2,
  to_quantized_weight, and the nvfp4_w4a16 branch of
  process_layer_quant_config. Add explicit W4A16 detection in
  _get_quantization_from_layer when input_quantizer is absent/disabled.
- modelopt/torch/export/unified_export_hf.py: add NVFP4_W4A16 to the
  weight_scale_2 registration and NVFP4 transpose lists.
- modelopt/torch/export/convert_hf_config.py: add NVFP4_W4A16 mapping in
  _quant_algo_to_group_config and convert_hf_quant_config_format so the
  llm-compressor conversion emits a weight-only config group.
- examples/llm_ptq/hf_ptq.py: rename qformat nvfp4_wo -> nvfp4_w4a16; add
  --exclude_modules CLI (composes with the Nemotron-H helpers added in
  the previous commit); emit a post-export vLLM deployment warning.
- examples/llm_ptq/scripts/huggingface_example.sh: add nvfp4_w4a16 to the
  qformat allowlist, EXCLUDE_MODULES env pass-through, and a W4A16 export
  notice.
- CHANGELOG.rst: document W4A16 (covers the overlap with #1313) and the
  Embedding / Nemotron-H enablement unique to this PR.
- tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py:
  add an nvfp4_w4a16 parametrize entry for the tiny_llama fixture.

pre-commit (ruff, ruff-format, mypy, bandit, insert-license, rst-lint)
passes on all touched files.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
examples/llm_ptq/hf_ptq.py (1)

597-637: ⚠️ Potential issue | 🟠 Major

Keep lm_head quantization aligned with the base Nemotron-H recipe.

This helper only re-enables *lm_head*weight_quantizer. When Nemotron-H runs with an activation-aware recipe such as fp8 or nvfp4, lm_head stops matching the rest of the model; for NVFP4, modelopt/torch/export/quant_utils.py now even reclassifies it as nvfp4_w4a16 because the input quantizer stays disabled. Either gate this helper to weight-only recipes or append a mirrored *lm_head*input_quantizer rule copied from the base config.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/hf_ptq.py` around lines 597 - 637, The helper
_enable_lm_head_and_embedding_quantization currently only re-enables the lm_head
weight quantizer which desynchronizes lm_head when activation-aware recipes
(e.g. fp8/nvfp4) are used; update it to either (A) only run when the active
recipe is weight-only (check quant_cfg["algorithm"] or similar indicator) OR (B)
also append a mirrored "*lm_head*input_quantizer" entry copied from the
base/input quantizer config so lm_head keeps the same input quantization as the
rest of the model; modify _enable_lm_head_and_embedding_quantization to perform
one of these two fixes and ensure the new entry uses copy.deepcopy like the
existing weight entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 686-697: The Nemotron-H opt-in currently adds enable/config
entries after quantize_main() has already appended user --exclude_modules rules
(so it can override user exclusions); update the flow so user exclusions are
respected by either moving the
_enable_lm_head_and_embedding_quantization(quant_cfg, weight_quantizer_cfg) call
to run before quantize_main()/before mono_quantize() applies exclude updates, or
(preferably) change _enable_lm_head_and_embedding_quantization to check
quant_cfg.exclude_modules (and any existing disable rules) and skip adding
enable/config entries for "lm_head" or "embeddings" if the user explicitly
excluded them; make this change around quantize_main(), mono_quantize(), and
_enable_lm_head_and_embedding_quantization so the user's --exclude_modules is
never silently undone.

In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 130-132: The wrapper currently expands shell globs because
EXCLUDE_MODULES is injected unquoted into PTQ_ARGS before calling hf_ptq.py; fix
this by preserving the literal pattern when forwarding --exclude_modules: stop
building a single unquoted string and use a bash array for PTQ_ARGS (e.g.,
append the two separate elements "--exclude_modules" and "$EXCLUDE_MODULES") or
otherwise ensure the variable is quoted when added so hf_ptq.py receives the
exact pattern (references: EXCLUDE_MODULES, PTQ_ARGS, and the --exclude_modules
argument passed to hf_ptq.py).

In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 191-198: The NVFP4_W4A16 branch sets config_group_details.targets
to only ["Linear"], which omits embeddings even though weight-only quantization
applies to Embedding layers; update the targets list in the NVFP4_W4A16 branch
(where quant_algo_value == "NVFP4_W4A16" and config_group_details is built) to
include "Embedding" (e.g., ["Linear", "Embedding"]) before assigning
new_config["config_groups"] so compressed-tensors exports match the actual
NVFP4_W4A16 quantization coverage.

---

Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only re-enables the lm_head weight quantizer which desynchronizes
lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to
either (A) only run when the active recipe is weight-only (check
quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored
"*lm_head*input_quantizer" entry copied from the base/input quantizer config so
lm_head keeps the same input quantization as the rest of the model; modify
_enable_lm_head_and_embedding_quantization to perform one of these two fixes and
ensure the new entry uses copy.deepcopy like the existing weight entries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7568812b-30c3-4c61-bbc4-d04ef8fc9364

📥 Commits

Reviewing files that changed from the base of the PR and between 43c3454 and 490b6b2.

📒 Files selected for processing (9)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/convert_hf_config.py
  • modelopt/torch/export/model_config.py
  • modelopt/torch/export/quant_utils.py
  • modelopt/torch/export/unified_export_hf.py
  • modelopt/torch/quantization/config.py
  • tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
  • modelopt/torch/export/model_config.py
  • CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/torch/export/unified_export_hf.py

Comment thread examples/llm_ptq/hf_ptq.py Outdated
Comment thread examples/llm_ptq/scripts/huggingface_example.sh Outdated
Comment thread modelopt/torch/export/convert_hf_config.py
@ajrasane ajrasane self-assigned this Apr 23, 2026
- example_utils: swap in a normalized generation_config via context manager
  during export instead of mutating the live one in get_model() — preview
  generate() calls stay deterministic
- hf_ptq: mirror *lm_head*input_quantizer for activation-aware recipes so
  lm_head doesn't silently downgrade to W4A16 under NVFP4/FP8
- hf_ptq: respect --exclude_modules in the Nemotron-H lm_head/embedding
  override so user exclusions aren't silently undone
- hf_ptq: add nvfp4_w4a16 to the auto_quantize qformat allowlist for
  consistency with QUANT_CFG_CHOICES
- huggingface_example.sh: pass --exclude_modules via a bash array (set -f)
  so wildcard patterns like *embed_tokens* reach argparse verbatim instead
  of being glob-expanded against the filesystem
- convert_hf_config: include Embedding in the NVFP4_W4A16 target set so
  compressed-tensors consumers dispatch on quantized embedding weights

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/llm_ptq/example_utils.py (1)

873-876: Consider documenting the top_k=50 check.

The check getattr(original, "top_k", None) not in (None, 0, 50) treats top_k=50 as a "default/unset" value. While 50 is indeed the transformers default, this implicit knowledge could benefit from a brief inline comment for maintainability.

💡 Suggested clarification
         has_sampling_hyperparam = (
             getattr(original, "top_p", None) not in (None, 1.0)
-            or getattr(original, "top_k", None) not in (None, 0, 50)
+            or getattr(original, "top_k", None) not in (None, 0, 50)  # 50 is transformers default
             or getattr(original, "temperature", None) not in (None, 1.0)
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/llm_ptq/example_utils.py` around lines 873 - 876, The condition
using getattr(original, "top_k", None) not in (None, 0, 50) implicitly treats
top_k==50 as a default/unset value; add a brief inline comment next to that
expression (or expand the surrounding docstring) stating that 50 is the
HuggingFace/transformers default so it should be treated as unset, e.g., “# 50
is transformers' default top_k, treat as unset”; ensure the comment references
getattr(original, "top_k", None) so future readers understand why 50 is
excluded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 873-876: The condition using getattr(original, "top_k", None) not
in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a
brief inline comment next to that expression (or expand the surrounding
docstring) stating that 50 is the HuggingFace/transformers default so it should
be treated as unset, e.g., “# 50 is transformers' default top_k, treat as
unset”; ensure the comment references getattr(original, "top_k", None) so future
readers understand why 50 is excluded.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0eeb4272-c356-455a-9a92-0ae3a6fbd489

📥 Commits

Reviewing files that changed from the base of the PR and between 490b6b2 and a115c88.

📒 Files selected for processing (4)
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • examples/llm_ptq/scripts/huggingface_example.sh
  • modelopt/torch/export/convert_hf_config.py

Comment thread examples/llm_ptq/hf_ptq.py Outdated
mts.export(full_model)


def _enable_lm_head_and_embedding_quantization(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define this in the modelop_recipe if everything modelopt_recipes/models can be captured with our yaml recipe system?

Comment thread examples/llm_ptq/hf_ptq.py Outdated
Comment on lines +728 to +754
# For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B),
# extend quantization coverage to the lm_head and the input token embedding. On this
# architecture those two 131072x3136 tables account for ~21% of parameters, so leaving
# them at bf16 wastes most of the NVFP4 memory benefit.
if model_type == "nemotron_h":
weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer")
if weight_quantizer_cfg is not None:
# ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4,
# ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and
# ``lm_head`` stays weight-only along with the embedding.
input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer")
print(
"Nemotron-H detected: extending quantization to lm_head and input embedding "
"(backbone.embeddings)."
)
_enable_lm_head_and_embedding_quantization(
quant_cfg,
weight_quantizer_cfg,
input_quantizer_cfg=input_quantizer_cfg,
user_excluded_modules=args.exclude_modules or None,
)
else:
warnings.warn(
"Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; "
"skipping lm_head/embedding extension (model-specific or non-standard recipe)."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment, wondering if our recipe system can replace this ad hoc change to support specific models

if args.qformat == "nvfp4_w4a16":
warnings.warn(
"TensorRT-LLM and SGLang do not support this format. "
"To serve on vLLM, convert the NVFP4 W4A16 checkpoint to compressed-tensors format."
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @ajrasane , should we point the users to how they can convert? do we have a helper in ModelOpt we should point them to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hychiang-git, are you planning to merge your conversion script to modelopt?

…ML recipe

Move the Nemotron-H-specific quantization extensions out of `hf_ptq.py` and
into a declarative recipe at `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`,
addressing PR #1327 review feedback. The recipe captures exactly what the
removed `_enable_lm_head_and_embedding_quantization` helper did:

* All Linear weight quantizers ON (NVFP4 W4A16, group_size 16, scale_bits e4m3).
* Standard `_default_disabled_quantizer_cfg` exclusions (BatchNorm, conv1d, etc.).
* `*lm_head*weight_quantizer`, `*embeddings*weight_quantizer`, and
  `*embed_tokens*weight_quantizer` re-enabled AFTER the default disables so
  they take precedence (last matching entry wins).

Drop the helpers (`_enable_lm_head_and_embedding_quantization`,
`_extract_wildcard_quantizer_cfg`) and the `if model_type == "nemotron_h":`
block in `mono_quantize`. Users now opt in explicitly via
`--recipe models/Nemotron-H/nvfp4_w4a16` instead of relying on auto-detection.

Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (RTX 6000 Ada,
calib_size=16, calib_seq=256): 94 weight quantizers enabled and 21 disabled
(the Mamba `*mixer.conv1d*` layers), `lm_head.weight_quantizer` and
`model.embeddings.weight_quantizer` carry NVFP4 cfg, exported safetensors is
2.13 GiB (matches prior PR-validation export size), and `hf_quant_config.json`
reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d
layers]`.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane requested a review from a team as a code owner May 1, 2026 18:23
ajrasane added 2 commits May 1, 2026 18:42
`_maybe_patch_transformers_nemotron_h_mixer_types()` papered over three gaps
in transformers 5.5.x's built-in Nemotron-H port:

1. `NemotronHConfig._pattern_to_list` not mapping `-` → mlp
2. `MIXER_TYPES` not registering `mlp`
3. `ALLOWED_LAYER_TYPES` not including `mlp`

modelopt's supported transformers floor is 4.56 (per `pyproject.toml` and
`modelopt/torch/__init__.py`). On 4.56.x the upstream Nemotron-H module
doesn't exist at all (`transformers.models.nemotron_h` is missing) — so the
patch's three branches all hit `ImportError` and silently no-op. With
`--trust_remote_code`, the Nemotron-H checkpoint's bundled
`configuration_nemotron_h.py` / `modeling_nemotron_h.py` carry their own
`MIXER_TYPES`, `_pattern_to_list`, and `validate_layers_block_type`, so the
ad-hoc transformers patches were never load-bearing for that codepath either.

Drop the helper and the `_maybe_patch_transformers_nemotron_h_mixer_types()`
call in `get_model()`. Smoke-tested on `transformers 4.56.2` with
`AutoConfig.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
trust_remote_code=True)` — config and model load cleanly without the patch.

Users running on transformers 5.5.x (modelopt's experimental band) who
attempt to load Nemotron-H without `--trust_remote_code` will hit the
upstream gap directly; the recommended path remains
`--trust_remote_code` + the model's bundled remote code.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
The `--exclude_modules` flag was added in this PR as an escape hatch for
overriding the auto-applied lm_head/embedding inclusion on Nemotron-H. Now
that meenchen's recipe-system review is addressed and the Nemotron-H
extensions live in `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`,
this flag has no remaining purpose: users who want different exclusions
write a different recipe.

Removes:
* the `--exclude_modules` argparse entry in `hf_ptq.py`
* the `args.exclude_modules` apply-loop in `quantize_main()`
* the `EXCLUDE_MODULES` env-var passthrough + `EXCLUDE_MODULES_ARGS` bash
  array in `examples/llm_ptq/scripts/huggingface_example.sh`

Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` with
`--recipe models/Nemotron-H/nvfp4_w4a16` (transformers 4.56.2, GPU 5,
calib_size=16): same coverage as before — 94 weight quantizers enabled,
21 disabled (the Mamba `*mixer.conv1d*` layers); `lm_head.weight_quantizer`
and `backbone.embeddings.weight_quantizer` carry NVFP4 W4A16 cfg;
exported safetensors 2.1 GiB; `hf_quant_config.json` reports
`quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d
layers]`. The recipe still dictates the exclusion set, so behavior is
unchanged for the supported codepath.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants