Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
Quantize lm_head + embedding for Nemotron-H, add NVFP4 W4A16 recipe#1327
Conversation
…cipe
Extends ModelOpt PTQ so the input token embedding and output LM head can
participate in NVFP4 quantization, and wires that up for Nemotron-H where
those two 131072x3136 tables are ~21% of parameters and leaving them in
bf16 wastes most of the compression.
Changes
- modelopt/torch/quantization/nn/modules/quant_embedding.py (new):
register nn.Embedding with QuantModuleRegistry. Weight-only wrapper that
inherits QuantLinearConvBase but disables the input_quantizer by default
(embedding inputs are integer indices, not activations).
- modelopt/torch/quantization/config.py: add NVFP4_DEFAULT_WEIGHT_ONLY_CFG
(W4A16) via the existing _nvfp4_selective_quant_cfg(..., weight_only=True)
helper; export via the `choices` set.
- modelopt/torch/export/unified_export_hf.py: _process_quantized_modules now
also walks quantized Embedding modules (previously is_quantlinear-only),
so the NVFP4 packing + scale registration path runs for them on export.
- examples/llm_ptq/hf_ptq.py: add `nvfp4_wo` qformat. For model_type ==
"nemotron_h", append cfg entries that re-enable *lm_head*weight_quantizer
and target the backbone embedding (*embeddings* / *embed_tokens*),
overriding the default *lm_head* disable in _default_disabled_quantizer_cfg.
- examples/llm_ptq/example_utils.py: environment workarounds so the example
runs on transformers 5.5.x's partial Nemotron-H port (idempotent, no-op
on fixed transformers):
* NemotronHConfig._pattern_to_list: add `-` -> `mlp`
* ALLOWED_LAYER_TYPES: add `"mlp"`
* NemotronHConfig.validate_layers_block_type: accept `"mlp"` (also
update __class_validators__ since @strict_dataclass snapshots it)
* MIXER_TYPES["mlp"]: adapter around NemotronHMLP that accepts the
layer_idx kwarg passed by NemotronHBlock
* NemotronHBlock.__init__: alias block_type=="mlp" -> "moe" so the
inline block_type_to_mask lookup in NemotronHModel.forward resolves
to None (dispatch is unaffected — the block's forward routes both
through the same `else` branch)
* generation_config: set do_sample=True when sampling hyperparams are
set, so export's save_pretrained passes 5.x strict validation
Validated end-to-end on nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:
python examples/llm_ptq/hf_ptq.py \
--pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \
--qformat nvfp4_wo --kv_cache_qformat none \
--trust_remote_code --dataset cnn_dailymail \
--calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \
--export_path /tmp/nemotron_3_nano_4b_nvfp4_wo
Produces a 2.1 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings and lm_head both exported as packed NVFP4 uint8 +
FP8 per-block scales + FP32 global scales.
Follow-ups (separate PRs):
- compressed-tensors conversion script for vLLM consumption (rename
weight -> weight_packed, weight_scale_2 -> weight_global_scale, rewrite
config.json quantization_config to format=nvfp4-pack-quantized).
- offline vLLM inference script for the converted checkpoint.
- Nemotron-H config.json post-export cleanup (transformers 5.x strips
hybrid_override_pattern in favor of the derived layers_block_type,
which breaks reload via the checkpoint's remote configuration_nemotron_h.py
because layers_block_type there is a read-only property).
- optional --vllm-compat hf_ptq flag that also excludes Mamba in_proj
(output dim 17504 not divisible by 64, violating vLLM's Marlin repack
alignment) so the export is consumable by vLLM out of the box.
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds NVFP4 W4A16 weight‑only quantization end‑to‑end (config, selection, exporter packing, tests), registers nn.Embedding quantization and exports embedding weights, adds a generation_config normalization export context, and exposes --exclude_modules in the PTQ CLI and example script. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Script
participant HF_PTQ
participant Quant_Config
participant Exporter
participant Filesystem
User->>Script: invoke (QFORMAT=nvfp4_w4a16, EXCLUDE_MODULES)
Script->>HF_PTQ: run hf_ptq with args
HF_PTQ->>Quant_Config: build/modify quant config (NVFP4_W4A16_CFG, exclusions)
HF_PTQ->>HF_PTQ: apply normalized_generation_config_for_export()
HF_PTQ->>Exporter: export_quantized()
Exporter->>Filesystem: write quantized safetensors (include embeddings, NVFP4 packing)
Exporter->>User: emit export path & runtime warnings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)
102-123:⚠️ Potential issue | 🟠 Major
nvfp4_wois still rejected byauto_quantize().The new choice is exposed here, but the hard-coded qformat allowlist in
auto_quantize()(Lines 325-344) was not updated.--auto_quantize_bits --qformat nvfp4_wo,...now fails the assertion even though the format is advertised.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 102 - 123, The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but not included in the hard-coded qformat allowlist inside auto_quantize(), causing the assertion failure; update the allowlist in the auto_quantize() function to include "nvfp4_wo" (or extend the allowlist to derive keys from QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted; search for the auto_quantize function and add "nvfp4_wo" to the qformat/allowlist there (or replace the static list with QUANT_CFG_CHOICES.keys()) to keep them in sync.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 853-867: The current code mutates the live model.generation_config
(gen_cfg) which makes the same model instance used by get_model()
non-deterministic; instead, create a copy of the generation_config (e.g., via
copy.deepcopy or by constructing a new GenerationConfig from the dict) and
modify the copy’s do_sample flag, leaving model.generation_config unchanged;
update the export/normalization logic around gen_cfg to use this gen_cfg_copy
(or a temporary variable) so previews/full_model.generate() remain deterministic
and only the exported metadata contains the normalized setting.
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only appends a "*lm_head*weight_quantizer" override which causes
activation-quantized recipes (e.g., fp8/nvfp4 applied by mono_quantize) to
become mixed-format at lm_head; update this function so it either (A) checks the
applied recipe/weight_quantizer_cfg and only appends the lm_head weight override
for weight-only formats, or (B) when adding "*lm_head*weight_quantizer" also
append a corresponding "*lm_head*input_quantizer" entry that mirrors the base
input-quantizer entry (use copy.deepcopy of the existing input_quantizer config)
so lm_head keeps the same activation format as the rest of the model; reference
_enable_lm_head_and_embedding_quantization, the "quant_cfg" list entries, and
mono_quantize when implementing the conditional or mirrored input_quantizer
addition.
---
Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 102-123: The qformat "nvfp4_wo" was added to QUANT_CFG_CHOICES but
not included in the hard-coded qformat allowlist inside auto_quantize(), causing
the assertion failure; update the allowlist in the auto_quantize() function to
include "nvfp4_wo" (or extend the allowlist to derive keys from
QUANT_CFG_CHOICES) so that --auto_quantize_bits --qformat nvfp4_wo is accepted;
search for the auto_quantize function and add "nvfp4_wo" to the
qformat/allowlist there (or replace the static list with
QUANT_CFG_CHOICES.keys()) to keep them in sync.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: a191d33d-6d3e-4cf2-abd0-8b9541d5908e
📒 Files selected for processing (6)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/nn/__init__.pymodelopt/torch/quantization/nn/modules/quant_embedding.py
| # Some model cards ship a generation_config.json that sets sampling hyperparameters | ||
| # (top_p, temperature) without ``do_sample=True`` (e.g. NVIDIA-Nemotron-3-Nano-4B-BF16). | ||
| # transformers 5.x strictly validates this on save_pretrained, so the export step | ||
| # fails with "GenerationConfig is invalid". Normalize by enabling do_sample whenever | ||
| # a sampling hyperparameter is set — this is only metadata, not behavior during | ||
| # calibration or export. | ||
| gen_cfg = getattr(model, "generation_config", None) | ||
| if gen_cfg is not None and not getattr(gen_cfg, "do_sample", False): | ||
| has_sampling_hyperparam = ( | ||
| getattr(gen_cfg, "top_p", None) not in (None, 1.0) | ||
| or getattr(gen_cfg, "top_k", None) not in (None, 0, 50) | ||
| or getattr(gen_cfg, "temperature", None) not in (None, 1.0) | ||
| ) | ||
| if has_sampling_hyperparam: | ||
| gen_cfg.do_sample = True |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/llm_ptq && head -n 1 example_utils.py && wc -l example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 186
🏁 Script executed:
cd examples/llm_ptq && sed -n '830,880p' example_utils.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2298
🏁 Script executed:
cd examples/llm_ptq && grep -n "def get_model" example_utils.py | head -5Repository: NVIDIA/Model-Optimizer
Length of output: 84
🏁 Script executed:
cd examples/llm_ptq && rg "get_model\(" hf_ptq.py -A 5 -B 2Repository: NVIDIA/Model-Optimizer
Length of output: 430
🏁 Script executed:
cd examples/llm_ptq && rg "\.generate\(" hf_ptq.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 1010
🏁 Script executed:
cd examples/llm_ptq && sed -n '1,50p' hf_ptq.py | head -30Repository: NVIDIA/Model-Optimizer
Length of output: 1060
🏁 Script executed:
cd examples/llm_ptq && grep -n "full_model" hf_ptq.py | head -20Repository: NVIDIA/Model-Optimizer
Length of output: 1080
🏁 Script executed:
cd examples/llm_ptq && sed -n '300,350p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1721
🏁 Script executed:
cd examples/llm_ptq && sed -n '334,345p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 465
🏁 Script executed:
cd examples/llm_ptq && sed -n '350,365p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 645
🏁 Script executed:
cd examples/llm_ptq && grep -n "generated_ids_before_ptq" hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 816
🏁 Script executed:
cd examples/llm_ptq && sed -n '257,270p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 653
🏁 Script executed:
cd examples/llm_ptq && sed -n '580,600p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 689
🏁 Script executed:
cd examples/llm_ptq && sed -n '880,925p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1892
🏁 Script executed:
cd examples/llm_ptq && sed -n '1100,1125p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1108
🏁 Script executed:
cd examples/llm_ptq && sed -n '970,1020p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 2331
🏁 Script executed:
cd examples/llm_ptq && sed -n '1180,1220p' hf_ptq.pyRepository: NVIDIA/Model-Optimizer
Length of output: 1472
Don't mutate the live generation_config in get_model().
The mutation persists on the returned model object, and both the before-PTQ and after-PTQ preview calls (full_model.generate() at lines 922 and 980 in hf_ptq.py) use that same model instance. For checkpoints with sampling hyperparameters, this makes the previews non-deterministic instead of deterministic, undermining PTQ smoke test comparisons. Normalize a copy during export instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 853 - 867, The current code
mutates the live model.generation_config (gen_cfg) which makes the same model
instance used by get_model() non-deterministic; instead, create a copy of the
generation_config (e.g., via copy.deepcopy or by constructing a new
GenerationConfig from the dict) and modify the copy’s do_sample flag, leaving
model.generation_config unchanged; update the export/normalization logic around
gen_cfg to use this gen_cfg_copy (or a temporary variable) so
previews/full_model.generate() remain deterministic and only the exported
metadata contains the normalized setting.
…mat ID Integrates the scaffolding from PR #1313 (NVFP4 W4A16 generic support) into this branch so the two PRs don't diverge on naming or export-path coverage. The embedding + Nemotron-H enablement in the previous commit is unchanged; this commit just adopts #1313's conventions for the pieces that overlap. Changes - modelopt/torch/quantization/config.py: rename the W4A16 recipe constant NVFP4_DEFAULT_WEIGHT_ONLY_CFG -> NVFP4_W4A16_CFG to match #1313. - modelopt/torch/export/model_config.py: add QUANTIZATION_NVFP4_W4A16 as a distinct format ID instead of relying on NVFP4 branches tolerating a disabled input_quantizer. - modelopt/torch/export/quant_utils.py: thread NVFP4_W4A16 through get_weight_scaling_factor, get_weight_scaling_factor_2, to_quantized_weight, and the nvfp4_w4a16 branch of process_layer_quant_config. Add explicit W4A16 detection in _get_quantization_from_layer when input_quantizer is absent/disabled. - modelopt/torch/export/unified_export_hf.py: add NVFP4_W4A16 to the weight_scale_2 registration and NVFP4 transpose lists. - modelopt/torch/export/convert_hf_config.py: add NVFP4_W4A16 mapping in _quant_algo_to_group_config and convert_hf_quant_config_format so the llm-compressor conversion emits a weight-only config group. - examples/llm_ptq/hf_ptq.py: rename qformat nvfp4_wo -> nvfp4_w4a16; add --exclude_modules CLI (composes with the Nemotron-H helpers added in the previous commit); emit a post-export vLLM deployment warning. - examples/llm_ptq/scripts/huggingface_example.sh: add nvfp4_w4a16 to the qformat allowlist, EXCLUDE_MODULES env pass-through, and a W4A16 export notice. - CHANGELOG.rst: document W4A16 (covers the overlap with #1313) and the Embedding / Nemotron-H enablement unique to this PR. - tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py: add an nvfp4_w4a16 parametrize entry for the tiny_llama fixture. pre-commit (ruff, ruff-format, mypy, bandit, insert-license, rst-lint) passes on all touched files. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
examples/llm_ptq/hf_ptq.py (1)
597-637:⚠️ Potential issue | 🟠 MajorKeep
lm_headquantization aligned with the base Nemotron-H recipe.This helper only re-enables
*lm_head*weight_quantizer. When Nemotron-H runs with an activation-aware recipe such asfp8ornvfp4,lm_headstops matching the rest of the model; for NVFP4,modelopt/torch/export/quant_utils.pynow even reclassifies it asnvfp4_w4a16because the input quantizer stays disabled. Either gate this helper to weight-only recipes or append a mirrored*lm_head*input_quantizerrule copied from the base config.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 597 - 637, The helper _enable_lm_head_and_embedding_quantization currently only re-enables the lm_head weight quantizer which desynchronizes lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to either (A) only run when the active recipe is weight-only (check quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored "*lm_head*input_quantizer" entry copied from the base/input quantizer config so lm_head keeps the same input quantization as the rest of the model; modify _enable_lm_head_and_embedding_quantization to perform one of these two fixes and ensure the new entry uses copy.deepcopy like the existing weight entries.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 686-697: The Nemotron-H opt-in currently adds enable/config
entries after quantize_main() has already appended user --exclude_modules rules
(so it can override user exclusions); update the flow so user exclusions are
respected by either moving the
_enable_lm_head_and_embedding_quantization(quant_cfg, weight_quantizer_cfg) call
to run before quantize_main()/before mono_quantize() applies exclude updates, or
(preferably) change _enable_lm_head_and_embedding_quantization to check
quant_cfg.exclude_modules (and any existing disable rules) and skip adding
enable/config entries for "lm_head" or "embeddings" if the user explicitly
excluded them; make this change around quantize_main(), mono_quantize(), and
_enable_lm_head_and_embedding_quantization so the user's --exclude_modules is
never silently undone.
In `@examples/llm_ptq/scripts/huggingface_example.sh`:
- Around line 130-132: The wrapper currently expands shell globs because
EXCLUDE_MODULES is injected unquoted into PTQ_ARGS before calling hf_ptq.py; fix
this by preserving the literal pattern when forwarding --exclude_modules: stop
building a single unquoted string and use a bash array for PTQ_ARGS (e.g.,
append the two separate elements "--exclude_modules" and "$EXCLUDE_MODULES") or
otherwise ensure the variable is quoted when added so hf_ptq.py receives the
exact pattern (references: EXCLUDE_MODULES, PTQ_ARGS, and the --exclude_modules
argument passed to hf_ptq.py).
In `@modelopt/torch/export/convert_hf_config.py`:
- Around line 191-198: The NVFP4_W4A16 branch sets config_group_details.targets
to only ["Linear"], which omits embeddings even though weight-only quantization
applies to Embedding layers; update the targets list in the NVFP4_W4A16 branch
(where quant_algo_value == "NVFP4_W4A16" and config_group_details is built) to
include "Embedding" (e.g., ["Linear", "Embedding"]) before assigning
new_config["config_groups"] so compressed-tensors exports match the actual
NVFP4_W4A16 quantization coverage.
---
Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 597-637: The helper _enable_lm_head_and_embedding_quantization
currently only re-enables the lm_head weight quantizer which desynchronizes
lm_head when activation-aware recipes (e.g. fp8/nvfp4) are used; update it to
either (A) only run when the active recipe is weight-only (check
quant_cfg["algorithm"] or similar indicator) OR (B) also append a mirrored
"*lm_head*input_quantizer" entry copied from the base/input quantizer config so
lm_head keeps the same input quantization as the rest of the model; modify
_enable_lm_head_and_embedding_quantization to perform one of these two fixes and
ensure the new entry uses copy.deepcopy like the existing weight entries.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7568812b-30c3-4c61-bbc4-d04ef8fc9364
📒 Files selected for processing (9)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/convert_hf_config.pymodelopt/torch/export/model_config.pymodelopt/torch/export/quant_utils.pymodelopt/torch/export/unified_export_hf.pymodelopt/torch/quantization/config.pytests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py
✅ Files skipped from review due to trivial changes (2)
- modelopt/torch/export/model_config.py
- CHANGELOG.rst
🚧 Files skipped from review as they are similar to previous changes (1)
- modelopt/torch/export/unified_export_hf.py
- example_utils: swap in a normalized generation_config via context manager during export instead of mutating the live one in get_model() — preview generate() calls stay deterministic - hf_ptq: mirror *lm_head*input_quantizer for activation-aware recipes so lm_head doesn't silently downgrade to W4A16 under NVFP4/FP8 - hf_ptq: respect --exclude_modules in the Nemotron-H lm_head/embedding override so user exclusions aren't silently undone - hf_ptq: add nvfp4_w4a16 to the auto_quantize qformat allowlist for consistency with QUANT_CFG_CHOICES - huggingface_example.sh: pass --exclude_modules via a bash array (set -f) so wildcard patterns like *embed_tokens* reach argparse verbatim instead of being glob-expanded against the filesystem - convert_hf_config: include Embedding in the NVFP4_W4A16 target set so compressed-tensors consumers dispatch on quantized embedding weights Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/llm_ptq/example_utils.py (1)
873-876: Consider documenting thetop_k=50check.The check
getattr(original, "top_k", None) not in (None, 0, 50)treatstop_k=50as a "default/unset" value. While 50 is indeed the transformers default, this implicit knowledge could benefit from a brief inline comment for maintainability.💡 Suggested clarification
has_sampling_hyperparam = ( getattr(original, "top_p", None) not in (None, 1.0) - or getattr(original, "top_k", None) not in (None, 0, 50) + or getattr(original, "top_k", None) not in (None, 0, 50) # 50 is transformers default or getattr(original, "temperature", None) not in (None, 1.0) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/example_utils.py` around lines 873 - 876, The condition using getattr(original, "top_k", None) not in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a brief inline comment next to that expression (or expand the surrounding docstring) stating that 50 is the HuggingFace/transformers default so it should be treated as unset, e.g., “# 50 is transformers' default top_k, treat as unset”; ensure the comment references getattr(original, "top_k", None) so future readers understand why 50 is excluded.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 873-876: The condition using getattr(original, "top_k", None) not
in (None, 0, 50) implicitly treats top_k==50 as a default/unset value; add a
brief inline comment next to that expression (or expand the surrounding
docstring) stating that 50 is the HuggingFace/transformers default so it should
be treated as unset, e.g., “# 50 is transformers' default top_k, treat as
unset”; ensure the comment references getattr(original, "top_k", None) so future
readers understand why 50 is excluded.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0eeb4272-c356-455a-9a92-0ae3a6fbd489
📒 Files selected for processing (4)
examples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pyexamples/llm_ptq/scripts/huggingface_example.shmodelopt/torch/export/convert_hf_config.py
| mts.export(full_model) | ||
|
|
||
|
|
||
| def _enable_lm_head_and_embedding_quantization( |
There was a problem hiding this comment.
Can we define this in the modelop_recipe if everything modelopt_recipes/models can be captured with our yaml recipe system?
| # For Nemotron-H (Mamba-2 + MLP + Attention hybrid, e.g. NVIDIA-Nemotron-3-Nano-4B), | ||
| # extend quantization coverage to the lm_head and the input token embedding. On this | ||
| # architecture those two 131072x3136 tables account for ~21% of parameters, so leaving | ||
| # them at bf16 wastes most of the NVFP4 memory benefit. | ||
| if model_type == "nemotron_h": | ||
| weight_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "weight_quantizer") | ||
| if weight_quantizer_cfg is not None: | ||
| # ``input_quantizer_cfg`` is present only for activation-aware recipes (fp8, nvfp4, | ||
| # ...). For weight-only recipes (nvfp4_w4a16, fp8_pb_wo, ...) this returns None and | ||
| # ``lm_head`` stays weight-only along with the embedding. | ||
| input_quantizer_cfg = _extract_wildcard_quantizer_cfg(quant_cfg, "input_quantizer") | ||
| print( | ||
| "Nemotron-H detected: extending quantization to lm_head and input embedding " | ||
| "(backbone.embeddings)." | ||
| ) | ||
| _enable_lm_head_and_embedding_quantization( | ||
| quant_cfg, | ||
| weight_quantizer_cfg, | ||
| input_quantizer_cfg=input_quantizer_cfg, | ||
| user_excluded_modules=args.exclude_modules or None, | ||
| ) | ||
| else: | ||
| warnings.warn( | ||
| "Nemotron-H detected but quant_cfg has no wildcard '*weight_quantizer' entry; " | ||
| "skipping lm_head/embedding extension (model-specific or non-standard recipe)." | ||
| ) | ||
|
|
There was a problem hiding this comment.
Same as the previous comment, wondering if our recipe system can replace this ad hoc change to support specific models
| if args.qformat == "nvfp4_w4a16": | ||
| warnings.warn( | ||
| "TensorRT-LLM and SGLang do not support this format. " | ||
| "To serve on vLLM, convert the NVFP4 W4A16 checkpoint to compressed-tensors format." |
There was a problem hiding this comment.
hi @ajrasane , should we point the users to how they can convert? do we have a helper in ModelOpt we should point them to?
There was a problem hiding this comment.
@hychiang-git, are you planning to merge your conversion script to modelopt?
…ML recipe Move the Nemotron-H-specific quantization extensions out of `hf_ptq.py` and into a declarative recipe at `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`, addressing PR #1327 review feedback. The recipe captures exactly what the removed `_enable_lm_head_and_embedding_quantization` helper did: * All Linear weight quantizers ON (NVFP4 W4A16, group_size 16, scale_bits e4m3). * Standard `_default_disabled_quantizer_cfg` exclusions (BatchNorm, conv1d, etc.). * `*lm_head*weight_quantizer`, `*embeddings*weight_quantizer`, and `*embed_tokens*weight_quantizer` re-enabled AFTER the default disables so they take precedence (last matching entry wins). Drop the helpers (`_enable_lm_head_and_embedding_quantization`, `_extract_wildcard_quantizer_cfg`) and the `if model_type == "nemotron_h":` block in `mono_quantize`. Users now opt in explicitly via `--recipe models/Nemotron-H/nvfp4_w4a16` instead of relying on auto-detection. Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` (RTX 6000 Ada, calib_size=16, calib_seq=256): 94 weight quantizers enabled and 21 disabled (the Mamba `*mixer.conv1d*` layers), `lm_head.weight_quantizer` and `model.embeddings.weight_quantizer` carry NVFP4 cfg, exported safetensors is 2.13 GiB (matches prior PR-validation export size), and `hf_quant_config.json` reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d layers]`. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
`_maybe_patch_transformers_nemotron_h_mixer_types()` papered over three gaps
in transformers 5.5.x's built-in Nemotron-H port:
1. `NemotronHConfig._pattern_to_list` not mapping `-` → mlp
2. `MIXER_TYPES` not registering `mlp`
3. `ALLOWED_LAYER_TYPES` not including `mlp`
modelopt's supported transformers floor is 4.56 (per `pyproject.toml` and
`modelopt/torch/__init__.py`). On 4.56.x the upstream Nemotron-H module
doesn't exist at all (`transformers.models.nemotron_h` is missing) — so the
patch's three branches all hit `ImportError` and silently no-op. With
`--trust_remote_code`, the Nemotron-H checkpoint's bundled
`configuration_nemotron_h.py` / `modeling_nemotron_h.py` carry their own
`MIXER_TYPES`, `_pattern_to_list`, and `validate_layers_block_type`, so the
ad-hoc transformers patches were never load-bearing for that codepath either.
Drop the helper and the `_maybe_patch_transformers_nemotron_h_mixer_types()`
call in `get_model()`. Smoke-tested on `transformers 4.56.2` with
`AutoConfig.from_pretrained("nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16",
trust_remote_code=True)` — config and model load cleanly without the patch.
Users running on transformers 5.5.x (modelopt's experimental band) who
attempt to load Nemotron-H without `--trust_remote_code` will hit the
upstream gap directly; the recommended path remains
`--trust_remote_code` + the model's bundled remote code.
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
The `--exclude_modules` flag was added in this PR as an escape hatch for overriding the auto-applied lm_head/embedding inclusion on Nemotron-H. Now that meenchen's recipe-system review is addressed and the Nemotron-H extensions live in `modelopt_recipes/models/Nemotron-H/nvfp4_w4a16.yaml`, this flag has no remaining purpose: users who want different exclusions write a different recipe. Removes: * the `--exclude_modules` argparse entry in `hf_ptq.py` * the `args.exclude_modules` apply-loop in `quantize_main()` * the `EXCLUDE_MODULES` env-var passthrough + `EXCLUDE_MODULES_ARGS` bash array in `examples/llm_ptq/scripts/huggingface_example.sh` Verified end-to-end on `nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16` with `--recipe models/Nemotron-H/nvfp4_w4a16` (transformers 4.56.2, GPU 5, calib_size=16): same coverage as before — 94 weight quantizers enabled, 21 disabled (the Mamba `*mixer.conv1d*` layers); `lm_head.weight_quantizer` and `backbone.embeddings.weight_quantizer` carry NVFP4 W4A16 cfg; exported safetensors 2.1 GiB; `hf_quant_config.json` reports `quant_algo=NVFP4_W4A16`, `group_size=16`, `exclude_modules=[21 conv1d layers]`. The recipe still dictates the exclusion set, so behavior is unchanged for the supported codepath. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Related PR
Aligns with #1313 (Support NVFP4 W4A16 quantization) — shares the
NVFP4_W4A16_CFGrecipe, thenvfp4_w4a16qformat, and theQUANTIZATION_NVFP4_W4A16format ID. This PR adds the embedding +lm_headquantization support on top.Summary
Extends ModelOpt PTQ so the input token embedding and output LM head can participate in NVFP4 quantization, and wires that up for Nemotron-H where those two
131072×3136tables are ~21% of parameters and leaving them in bf16 wastes most of the compression.Changes
Core quantization library
modelopt/torch/quantization/nn/modules/quant_embedding.py(new) — Registernn.EmbeddingwithQuantModuleRegistry. Weight-only wrapper that inheritsQuantLinearConvBasebut disables theinput_quantizerby default (embedding inputs are integer indices, not activations; output_quantizer is already disabled byQuantInputBase._setup).modelopt/torch/quantization/nn/__init__.py— Import the newquant_embeddingmodule so registration fires at library import time.Export
modelopt/torch/export/unified_export_hf.py—_process_quantized_modulesnow also walks quantizedEmbeddingmodules (previouslyis_quantlinear-only), so the NVFP4 packing + scale registration path in_export_quantized_weightruns for them on export.hf_ptq.pyexamplemodel_type == "nemotron_h", append cfg entries that re-enable*lm_head*weight_quantizerand target the backbone embedding (*embeddings*/*embed_tokens*), overriding the default*lm_head*disable in_default_disabled_quantizer_cfg. Guarded helpers (_enable_lm_head_and_embedding_quantization,_extract_weight_quantizer_cfg) so the override only fires when a standard*weight_quantizerentry is present.example_utils.py— environment workaroundsThese are idempotent workarounds for
transformers 5.5.x's partial Nemotron-H port; they no-op on a fixed transformers (e.g. inside the TRT Docker container's newer wheel):NemotronHConfig._pattern_to_list: add-→mlpALLOWED_LAYER_TYPES: add"mlp"NemotronHConfig.validate_layers_block_type: accept"mlp"(also update__class_validators__since huggingface_hub's@strict_dataclasssnapshots validators at class-creation time, so overwriting the method attribute alone isn't enough)MIXER_TYPES["mlp"]: adapter aroundNemotronHMLPthat accepts thelayer_idxkwarg passed byNemotronHBlockNemotronHBlock.__init__: aliasblock_type == "mlp"→"moe"so the inlineblock_type_to_masklookup inNemotronHModel.forwardresolves toNone(dispatch is unaffected — the block's forward routes both through the sameelsebranch that callsself.mixer(hidden_states))generation_config: setdo_sample=Truewhen sampling hyperparams are set, so export'ssave_pretrainedpasses transformers 5.x strict validationValidation
End-to-end on
nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16:python examples/llm_ptq/hf_ptq.py \ --pyt_ckpt_path nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16 \ --qformat nvfp4_w4a16 --kv_cache_qformat none \ --trust_remote_code --dataset cnn_dailymail \ --calib_size 16 --calib_seq 256 --batch_size 1 --skip_generate \ --export_path /tmp/nemotron_3_nano_4b_nvfp4_w4a16Produces a 2.2 GB unified HF checkpoint (vs 7.5 GB bf16), with
model.embeddings.weightandlm_head.weightboth stored as packed NVFP4 uint8 + FP8 per-block scales + FP32 global scale.hf_quant_config.jsonreportsquant_algo: NVFP4_W4A16,group_size: 16, andexclude_modulescontains only the 21 Mambaconv1dlayers (the default_default_disabled_quantizer_cfgentry for*mixer.conv1d*).pre-commit run --files <staged>passes (ruff, ruff-format, mypy, bandit, insert-license, rst-lint).Follow-ups (separate PRs)
*.weight → *.weight_packed,*.weight_scale_2 → *.weight_global_scale(inverted), and rewritesconfig.jsonquantization_configtoformat: nvfp4-pack-quantized/quant_method: compressed-tensors. Already prototyped out-of-tree; just needs cleanup + tests.vllm.LLMwith chat-template rendering,max_model_lencap,--enforce-eagerdefault for Mamba/SSM). Already prototyped out-of-tree.config.jsonpost-export cleanup: transformers 5.x stripshybrid_override_patternin favor of the derivedlayers_block_typelist, which breaks reload via the checkpoint's remoteconfiguration_nemotron_h.py(itslayers_block_typeis a read-only@property). The export path should restorehybrid_override_patternand setnum_hidden_layersexplicitly formodel_type == "nemotron_h".--vllm-compathf_ptqflag that additionally excludes Mambain_proj(output dim 17504 =intermediate + conv_dim + num_headsisn't divisible by 64, violating vLLM's Marlin repack alignment) and leaveslm_head/model.embeddingsin bf16 (vLLM'sParallelLMHead/VocabParallelEmbeddingdon't consume compressed-tensors scales), so the export is consumable by vLLM out of the box.example_utils.pymonkey-patches can be dropped.Test plan
nn.Embeddingregisters and is replaced withQuantEmbeddingundermtq.quantize(..., NVFP4_W4A16_CFG, forward_loop=None)on a toySequential(Embedding, Linear)model; verified forward pass on CUDA.nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16(see Validation above).tests/gpu/torch/export/fornn.Embeddingweight packing — follow-up once the conversion/load path lands.hf_ptq.py, and this PR doesn't change that.Summary by CodeRabbit
New Features
Integration
Tests