Add Gemma4 MoE quantization support#1219
Conversation
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds optional Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)
1092-1097:⚠️ Potential issue | 🟠 MajorPass the configured auto-quantize options through this call.
This hunk only forwards
full_model, so--auto_quantize_method,--auto_quantize_score_size, and--auto_quantize_checkpointare still ignored here. The helper falls back to its defaults instead of the CLI values.Possible fix
auto_quantize( args, language_model, calib_dataloader, + auto_quantize_method=args.auto_quantize_method, + auto_quantize_score_size=args.auto_quantize_score_size, + auto_quantize_checkpoint=args.auto_quantize_checkpoint, full_model=full_model, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 1092 - 1097, The call to auto_quantize currently only forwards args, language_model, calib_dataloader and full_model, so CLI options for auto-quantization are ignored; update the call to pass the configured auto-quantize options from args (e.g. args.auto_quantize_method, args.auto_quantize_score_size, args.auto_quantize_checkpoint — match the actual arg names) into auto_quantize so the helper receives the CLI values rather than falling back to defaults (refer to auto_quantize, args, language_model, calib_dataloader, full_model in the diff).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 352-379: The gradient/kl_div base-model path passes the raw batch
(which includes "labels") into the extracted base model, causing a TypeError
because base models like Gemma4TextModel don't accept labels; add a small helper
(e.g., sanitize_batch or strip_non_inputs) that removes "labels" and any
non-forward kwargs from the batch, then call that helper inside both
forward_step implementations referenced in the is_base_model branch (where
full_model, lm_head, loss_func, forward_step and auto_quantize_method are
defined) so the model receives only valid forward inputs while loss_func still
reads labels from the original batch.
---
Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1092-1097: The call to auto_quantize currently only forwards args,
language_model, calib_dataloader and full_model, so CLI options for
auto-quantization are ignored; update the call to pass the configured
auto-quantize options from args (e.g. args.auto_quantize_method,
args.auto_quantize_score_size, args.auto_quantize_checkpoint — match the actual
arg names) into auto_quantize so the helper receives the CLI values rather than
falling back to defaults (refer to auto_quantize, args, language_model,
calib_dataloader, full_model in the diff).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c05034eb-0929-498f-be1d-c052746d2eba
📒 Files selected for processing (4)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/layer_utils.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/plugins/huggingface.py
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
PTQ can silently skip MoE expert quantization when config patterns (*mlp*, *block_sparse_moe*) don't match the model's naming convention (e.g., Gemma4 uses layers.N.experts.* instead of mlp.experts.*). This causes deployment failures downstream when vLLM/SGLang tries to load unquantized experts as quantized. Add Step 5 validation to detect this: - Compare exported weight names against scale params and exclude list - Flag weights with no scales that aren't in exclude_modules - Reference the deployment "quant/unquant layer confusion" pattern Also add MoE expert verification to unsupported-models.md debugging tips. Learned from: Gemma4-26B-A4B NVFP4 PTQ succeeded but experts were BF16, causing vLLM FusedMoE shape mismatch at deployment time. Fix tracked in PR #1219. Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
For VLMs like Gemma4 where the extracted language_model lacks lm_head, use the full_model's lm_head to compute logits/loss from hidden states. How to run: cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \ --pyt_ckpt_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it \ --qformat nvfp4,fp8 \ --auto_quantize_bits 6.0 \ --calib_size 512 \ --dataset cnn_dailymail \ --export_path /lustre/fsw/portfolios/coreai/users/yueshen/models/gemma-4-31B-it-autoquant-6.0 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
Add assert for full_model to satisfy mypy union-attr check, and add blank lines before nested def statements per ruff formatting rules. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
Gemma4 MoE models (e.g. google/gemma-4-26B-A4B-it) store expert weights as fused 3D nn.Parameter tensors instead of nn.ModuleList of nn.Linear, causing the quantizer to silently skip expert weights. - Register Gemma4TextExperts with _QuantQwen35MoeExperts plugin to unfuse 3D tensors into per-expert nn.Linear layers for quantization - Add structural is_moe() detection for modules with router + experts attributes (Gemma4 has no dedicated SparseMoeBlock class) - Add Gemma4TextDecoderLayer to get_expert_linear_names() - Add "*.experts.*" pattern to NVFP4_MLP_ONLY_CFG and NVFP4_EXPERTS_ONLY_CFG to match Gemma4's expert path (experts are at model.layers.X.experts, not under mlp) Signed-off-by: Yue Shen <yueshen@nvidia.com> Signed-off-by: James Shen <yueshen@nvidia.com>
Example usage for Gemma4 MoE quantization:
cd /opt/Model-Optimizer/examples/llm_ptq && python hf_ptq.py \
--pyt_ckpt_path /models/gemma-4-26B-A4B-it \
--qformat nvfp4_mlp_only \
--calib_size 512 \
--dataset cnn_dailymail \
--export_path /models/gemma-4-26B-A4B-it-nvfp4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: James Shen <yueshen@nvidia.com>
c79ebc0 to
509d256
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
examples/llm_ptq/hf_ptq.py (1)
1069-1076:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAssign the result of
auto_quantize()back tolanguage_model.
auto_quantize()returns the updated module, but this call site drops it. If the search path replaces the module instance, the rest ofquantize_main()will keep exporting the stale reference, and extracted-VLM flows will leavefull_modelwired to the pre-quantized submodule. Please mirror themono_quantize()pattern here and rebind/reattach the returned module.Suggested fix
- auto_quantize( + language_model = auto_quantize( args, language_model, calib_dataloader, auto_quantize_method=args.auto_quantize_method, auto_quantize_score_size=args.auto_quantize_score_size, auto_quantize_checkpoint=args.auto_quantize_checkpoint, full_model=full_model, ) + language_model_lineage = get_language_model_from_vl(full_model) + if language_model_lineage is not None: + language_model_lineage[-2].language_model = language_model🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/llm_ptq/hf_ptq.py` around lines 1069 - 1076, The call to auto_quantize(...) in quantize_main drops its return value so the possibly-replaced module isn't reattached; rebind language_model to the returned module (like mono_quantize does) so subsequent code and full_model reference the updated submodule—i.e., assign language_model = auto_quantize(...) and ensure any places that expect the updated module (e.g., full_model composition/export) use this new reference.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/layer_utils.py`:
- Around line 318-325: The new structural branch in is_moe() causes
Gemma4TextDecoderLayer to be treated as MoE but get_experts_list() still rejects
Gemma4 types, so either extend get_experts_list() to return experts for Gemma4
layers (e.g., recognize Gemma4TextDecoderLayer and extract module.experts/router
similarly) or restrict is_moe() so it only returns True for structural MoE when
the module type is supported by get_experts_list(); update the logic in is_moe()
and/or get_experts_list() (referencing is_moe() and get_experts_list()) so the
two functions remain consistent and Gemma4 export no longer trips the MoE branch
without a compatible expert-list implementation.
---
Outside diff comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1069-1076: The call to auto_quantize(...) in quantize_main drops
its return value so the possibly-replaced module isn't reattached; rebind
language_model to the returned module (like mono_quantize does) so subsequent
code and full_model reference the updated submodule—i.e., assign language_model
= auto_quantize(...) and ensure any places that expect the updated module (e.g.,
full_model composition/export) use this new reference.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d8897c78-c1a7-4df0-9a38-5ae33519815f
📒 Files selected for processing (2)
examples/llm_ptq/hf_ptq.pymodelopt/torch/export/layer_utils.py
- Strip `labels` from batch before passing to base text models that don't accept it (e.g. Gemma4TextModel) in auto_quantize forward_step - Pass CLI auto-quantize options (method, score_size, checkpoint) through to auto_quantize() instead of falling back to defaults - Drop explicit Gemma4TextExperts/Qwen3_5MoeExperts registration; handled by register_fused_experts_on_the_fly auto-detection - Clarify base-model detection comment as VLM-generic, not Gemma4-specific Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
509d256 to
53de662
Compare
Extend get_experts_list() to handle Gemma4 models so that the export path doesn't hit NotImplementedError when is_moe() structurally detects Gemma4TextDecoderLayer as MoE. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, focused PR (85 lines) adding Gemma4 MoE quantization support. All critical previous review comments have been addressed (labels stripping, args forwarding, is_moe/get_experts_list consistency, generic naming).
Remaining concerns:
-
No unit tests — The PR includes only integration-level test evidence (manual runs). No automated tests for the new
is_moestructural detection, the newget_expert_linear_namesbranch, or the base-model auto_quantize path. This is a recurring pattern in this repo but worth flagging. -
Broad structural
is_moe()detection — The new structural check (hasattr(module, "router") and hasattr(module, "experts") and isinstance(module.experts, nn.Module)) is quite broad. Any future module withrouter+expertsattributes would be classified as MoE. This works for Gemma4 today, but consider adding a comment noting this is intentionally broad, or tightening it (e.g., checking the module type name contains "decoder" or "layer"). -
auto_quantizereturn value not captured (minor) — The CodeRabbit concern about not capturingauto_quantize()'s return inquantize_mainwas not addressed, but this follows the same in-place-modification pattern asmono_quantize, so it's likely fine in practice.
Overall the changes look correct and well-scoped. Nudging for human review primarily due to missing unit tests.
Sync the YAML recipe with NVFP4_MLP_ONLY_CFG which was updated to include *.experts.* for Gemma4 MoE support. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
|
Thanks for the thorough review! 1. Unit tests — Agreed this deserves coverage. The 2. Broad 3. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1219 +/- ##
==========================================
+ Coverage 66.35% 76.10% +9.75%
==========================================
Files 471 471
Lines 50500 51602 +1102
==========================================
+ Hits 33508 39274 +5766
+ Misses 16992 12328 -4664
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Unify forward_step definitions across is_base_model branches to reduce duplication. loss_func remains split as the logic differs fundamentally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
dbd31df to
ea21ea4
Compare
18333f1 to
5238d63
Compare
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
All critical previous review comments have been addressed. Unit tests are now included covering is_moe and get_expert_linear_names. The code correctly extends existing MoE quantization infrastructure for Gemma4. One minor issue: the new test file has copyright year 2024, should be 2025 (or 2026 per the current LICENSE_HEADER).
…names Test is_moe with name-based detection (sparsemoeblock, moelayer, arcticmoe), structural detection (router + experts), and negative cases. Test get_expert_linear_names for Gemma4, Mixtral, and NemotronH module types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: James Shen <yueshen@nvidia.com>
5238d63 to
01fc243
Compare
|
|
||
| metadata: | ||
| recipe_type: ptq | ||
| description: NVFP4 static weight and dynamic activation for all linear layers (W4A4), FP8 KV cache, max calibration. |
There was a problem hiding this comment.
nit: I think we want to update this description
There was a problem hiding this comment.
Do you have suggestions on what description to be adopted?
Summary
Gemma4TextExpertswith_QuantQwen35MoeExpertsplugin to unfuse fused 3D expert tensors into per-expertnn.Linearlayers for quantizationis_moe()detection for modules withrouter+expertsattributes (Gemma4 has no dedicatedSparseMoeBlockclass — the decoder layer directly ownsrouterandexperts)Gemma4TextDecoderLayertoget_expert_linear_names()returning["gate_proj", "down_proj", "up_proj"]"*.experts.*"pattern toNVFP4_MLP_ONLY_CFGandNVFP4_EXPERTS_ONLY_CFGto match Gemma4's expert path (model.layers.X.experts.*, not nested undermlp)Context: Gemma4 MoE models (e.g.
google/gemma-4-26B-A4B-it) store expert weights as fused 3Dnn.Parametertensors (gate_up_proj,down_proj) instead ofnn.ModuleListofnn.Linear. Since ModelOpt's quantizer only discoversnn.Linearmodules, it silently skips the expert weights — the bulk of the model remains unquantized.Companion vLLM PR: vllm-project/vllm#39406 (robust quantized MoE weight loading for Gemma4)
Test plan
hf_ptq.py --pyt_ckpt_path google/gemma-4-26B-A4B-it --qformat nvfp4_mlp_only— 35k+ quantizers inserted, 17GB output (vs 49GB BF16)vllm serve <path> --quantization modelopt— loads and serves successfully🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes