Add layerwise calibration for large models by realAsma · Pull Request #1251 · NVIDIA/Model-Optimizer

realAsma · 2026-04-13T22:24:15Z

Summary

Adds performant layerwise calibration for quantizing large models (e.g. DeepSeek-R1 671B) that don't fit entirely on GPU. (Example commands)

Performant calibration for large models — Each decoder layer is moved from CPU/disk to GPU (accelerate) or unsharded (FSDP2) only once and kept on GPU for the entire calibration step. Previously, every calibration batch triggered weight transfer for every layer — O(num_batches) weight movements per layer. Now it is O(1) per layer. This also means you can increase batch size since only one layer's weights occupy GPU at a time — e.g. DeepSeek-R1 on a single node (8×80GB) with batch_size=16 and gpu_max_mem_percentage=0.5.
Checkpoint save/resume — Saves progress after each layer, so jobs that exceed cluster time limits (e.g. 4-hour Slurm windows for 100+ layer MoE models) can resume from the last completed layer.
Rename sequential_calibrate → layerwise_calibrate for clarity.

Design details

The existing layerwise state machine (skip/run/capture) already processes one layer at a time, but skip-mode layers still kept their parameters in the ModuleList — so frameworks transferred all weights every forward pass. This PR adds:

_SkipLayer: replaces fully-calibrated layers with a parameter-free dummy in the ModuleList, so framework hooks have nothing to transfer
persistent_materialization: keeps the active layer on GPU for the entire calibration step, avoiding repeated offload/reload cycles

Checkpoint save is per-layer; restore is bulk — quantizer state and weights for layers 0..K-1 are restored once at the end of calibration, keeping the hot path fast.

Example commands

Qwen3-8B (NVFP4+GPTQ, single GPU):

python hf_ptq.py \
    --pyt_ckpt_path Qwen/Qwen3-8B \
    --recipe nvfp4_gptq_sequential.yaml \
    --calib_size 64 \
    --batch_size 16 \
    --dataset cnn_dailymail \
    --export_path outputs/qwen3_8b_nvfp4_gptq_seq \
    --gpu_max_mem_percentage 0.5 \
    --use_seq_device_map \
    --vllm_fakequant_export

DeepSeek-R1 (NVFP4 experts-only + FP8 KV, 8×80GB):

python hf_ptq.py \
    --model unsloth/DeepSeek-R1-0528-BF16 \
    --recipe ../../modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml \
    --dataset cnn_dailymail \
    --batch_size 16 \
    --calib_size 64 \
    --calib_seq 512 \
    --gpu_max_mem_percentage 0.5 \
    --use_seq_device_map \
    --trust_remote_code \
    --export_path output/DeepSeek-R1-BF16-nvfp4-experts-only-fp8-kv \
    --vllm_fakequant_export

Example: NVFP4+GPTQ layerwise calibration on Qwen3-8B (36 layers, single GPU — 20 GB peak)

Initial run (killed after layer 11):

Layerwise calibration: Found 36 transformer layers
Calibrating layer 1/36 | capture: [1]
Computing Hessians for 7 linear layers...
GPTQ time: 51.39s
Calibrating layer 2/36 | run: [1] | capture: [2]
Checkpoint: saved layer 0
GPTQ time: 50.06s
Calibrating layer 3/36 | skip: 1 | run: [2] | capture: [3]
Checkpoint: saved layer 1
...
Calibrating layer 12/36 | skip: 10 | run: [11] | capture: [12]
Checkpoint: saved layer 10
<killed>

Resumed run (picks up from layer 11, finishes all 36):

Layerwise calibration: Found 36 transformer layers
Checkpoint: resuming layerwise calibration from layer 11/36
Calibrating layer 12 (resumed)
GPTQ time: 51.45s
Calibrating layer 13/36 | skip: 11 | run: [12] | capture: [13]
Checkpoint: saved layer 11
...
Calibrating layer 36/36 | skip: 34 | run: [35] | capture: [36]
Checkpoint: saved layer 34
GPTQ time: 50.33s
Checkpoint: saved layer 35 (final)
Checkpoint: restored 11 previously calibrated layers
Layerwise calibration completed
Quantized model exported to: outputs/qwen3_8b_nvfp4_gptq_seq
GPU 0: Peak memory usage = 20.42 GB

TODO

Update CHANGELOG

Test plan

tests/unit/torch/quantization/test_layerwise_calibrate.py — unit tests for skip/swap/restore
tests/unit/torch/quantization/test_sequential_checkpoint.py — checkpoint save/resume correctness
tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py — CPU-offloaded layerwise + GPTQ + checkpoint resume
tests/gpu/torch/quantization/test_fsdp2.py — FSDP2 layerwise calibration

Verified

Qwen3-8B: layerwise calibration + checkpoint save/restore + fakequantized checkpoint export + vLLM serve
DeepSeek-R1: checkpoint resume tested
DeepSeek-R1: fakequantized checkpoint export verified

copy-pr-bot · 2026-04-13T22:24:20Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-04-13T22:28:58Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-04-18 00:33 UTC

coderabbitai · 2026-04-13T22:29:25Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Rename calibration mode flag use_sequential → use_layerwise, add optional checkpoint_dir, replace sequential calibration with a new layerwise calibrator (with per-layer checkpoints/resume), introduce a new layerwise activation collector, update accelerate/FSDP/device helpers, and add extensive tests and example helpers.

Changes

Cohort / File(s)	Summary
Config `modelopt/torch/quantization/config.py`	Renamed `use_sequential` → `use_layerwise` on `QuantizeAlgorithmConfig`; added optional `checkpoint_dir: str
Calibration entrypoint & mode `modelopt/torch/quantization/model_calib.py`, `modelopt/torch/quantization/mode.py`	Replaced `sequential_calibrate` with `layerwise_calibrate`; wrapper now respects `use_layerwise` and `checkpoint_dir`, routes to `layerwise_calibrate`, and forwards checkpointing kwargs.
Layerwise implementation (new) `modelopt/torch/quantization/utils/layerwise_calib.py`	New module implementing `LayerActivationCollector`, per-layer modes (capture/run/skip/original), persistent-materialization helpers, checkpoint manifest and per-layer snapshot/save/restore, and resume detection.
Removed legacy collector `modelopt/torch/quantization/utils/activation_collector.py`	Deleted the old sequential `LayerActivationCollector` implementation.
Utils exports & imports `modelopt/torch/quantization/utils/__init__.py`, `modelopt/torch/quantization/plugins/huggingface.py`, tests...	Switched imports to `layerwise_calib` collector across utils, plugins, and tests.
Accelerate CPU-offload integration `modelopt/torch/quantization/plugins/accelerate.py`	Relaxed `weights_map` validation, added `_writeback_params_to_weights_map`, and reworked `weight_access_and_writeback_context` to handle single-module and child-hook layouts with multi-param writeback and correct pre/post hooks.
FSDP2 / core utils `modelopt/torch/quantization/utils/core_utils.py`, `modelopt/torch/utils/network.py`	Added `_set_parameter`, `persistent_materialization`, `_disable_fsdp_unshard_reshard`; generalized FSDP2 parameter access/writeback across all named parameters; `get_module_device` now considers accelerate hook `execution_device`.
Hessian/device tweak `modelopt/torch/quantization/utils/calib_utils.py`	Force Hessian allocation on CPU when module weight device is `meta`.
Examples & CLI helpers `examples/llm_ptq/hf_ptq.py`, `examples/llm_ptq/example_utils.py`	Add `needs_checkpoint_path_update` and `resolve_checkpoint_dir`; normalize KV `cfg`; auto-resolve and print checkpoint dir before quantization when applicable.
Dataset loop tweak `modelopt/torch/utils/dataset_utils.py`	Temporarily disable `model.config.use_cache` during `_forward_loop` and restore it afterwards.
Tests `tests/...`	Extensive test additions and updates: replace sequential→layerwise in tests, add many layerwise/checkpoint/resume/FSDP/accelerate integration tests, and `get_module_device` unit tests.

Sequence Diagram(s)

sequenceDiagram
  participant Entrypoint as Calibration Entrypoint
  participant Model as Model
  participant Collector as LayerActivationCollector
  participant Forward as ForwardLoop
  participant Checkpoint as CheckpointStore
  participant GPTQ as GPTQ Updater

  Entrypoint->>Collector: attach/discover layers
  Entrypoint->>Checkpoint: detect_resume_point(checkpoint_dir)
  alt resume available
    Checkpoint-->>Collector: restore output_meta + next_inputs
  end
  loop for layer in start_layer..N
    Entrypoint->>Collector: set mode -> capture(layer)
    Entrypoint->>Forward: run forward (captures inputs / EarlyStop)
    Collector-->>Entrypoint: captured inputs
    Entrypoint->>Collector: set mode -> run(layer)
    Entrypoint->>Forward: replay captured inputs -> outputs
    Entrypoint->>Checkpoint: save(layer_weights, quantizer_state, output_meta, next_inputs)
    alt GPTQ enabled
      Entrypoint->>GPTQ: update_weights_for_layer(...)
    end
  end
  Entrypoint->>Checkpoint: full_restore(all_layers)
  Entrypoint->>Collector: unpatch and cleanup

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	Four torch.load() calls in layerwise_calib.py lack required inline security justification comments per SECURITY.md requirements.	Add inline security comments to torch.load() calls at lines 578, 589, 613, 620 explaining files are internally-generated and trusted.
Docstring Coverage	⚠️ Warning	Docstring coverage is 60.65% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add layerwise calibration for large models' clearly summarizes the main change, referring to the rename of sequential_calibrate to layerwise_calibrate and the addition of checkpoint save/resume support for large model calibration.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch asma/ptq-large-models

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-14T17:03:13Z

Codecov Report

❌ Patch coverage is 94.73684% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.67%. Comparing base (dc7ad66) to head (a33c01f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
...delopt/torch/quantization/utils/layerwise_calib.py	94.02%	20 Missing ⚠️
modelopt/torch/export/plugins/vllm_fakequant_hf.py	92.18%	5 Missing ⚠️
modelopt/torch/quantization/plugins/accelerate.py	90.90%	3 Missing ⚠️
modelopt/torch/quantization/utils/core_utils.py	97.43%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1251      +/-   ##
==========================================
+ Coverage   72.52%   76.67%   +4.15%     
==========================================
  Files         459      459              
  Lines       48664    48975     +311     
==========================================
+ Hits        35292    37552    +2260     
+ Misses      13372    11423    -1949

Flag	Coverage Δ
examples	`41.18% <21.05%> (+1.81%)`	⬆️
gpu	`60.18% <92.37%> (+7.87%)`	⬆️
unit	`52.29% <70.23%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv

This is a substantial PR (~1500 lines) that adds checkpoint save/resume for sequential calibration, extends support to FSDP2 and accelerate-offloaded models, and renames activation_collector.py → layerwise_calib.py. The changes are cohesive and well-tested (unit + GPU tests for checkpoint, resume, offload, FSDP2 scenarios).

Key issues found:

Removed guard on sequential calibration methods — The assertion restricting sequential calibration to max and gptq was removed without replacement. Methods like awq, smoothquant, and svdquant operate on the full model (not per-layer) and will break silently or produce incorrect results when used with use_sequential=True.
weights_only=False security concern — torch.load(..., weights_only=False) is used for loading checkpoints, which can execute arbitrary code. While the checkpoints are locally generated, this is flagged by security scanners and should use weights_only=True where possible.

Minor observations:

PR size is above ~1000 lines but the changes are cohesive and hard to split
Good test coverage for the new functionality
The temporarily_remove_accelerate_hook rewrite is a nice improvement avoiding the init_hook pitfall
_writeback_params_to_weights_map properly handles all parameters (not just weight)
FSDP2 context manager correctly generalized to handle all DTensor parameters

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/torch/quantization/model_calib.py (1)
1566-1632: ⚠️ Potential issue | 🔴 Critical

Add inline comments to torch.load(..., weights_only=False) calls in layerwise_calib.py.

Per SECURITY.md and the coding guidelines, torch.load(..., weights_only=False) must include an inline comment documenting why the file is internally-generated/trusted and safe to deserialize. Lines 545 and 555 in modelopt/torch/quantization/utils/layerwise_calib.py need this justification:

Line 545: Loading output_meta.pt

Line 555: Loading next_inputs.pt

Add a comment before each call explaining these checkpoint files are generated and managed internally by the sequential calibration process, confirming they are trusted sources.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/model_calib.py` around lines 1566 - 1632, Add
inline comments immediately before the two torch.load(..., weights_only=False)
calls in modelopt.torch.quantization.utils.layerwise_calib (look around the
_CheckpointState usage and the methods that load "output_meta.pt" and
"next_inputs.pt") stating that these checkpoint files ("output_meta.pt" and
"next_inputs.pt") are generated and managed internally by the sequential
calibration process, are not user-supplied, and therefore are trusted for safe
deserialization; locate the calls near methods that restore checkpoint state
(e.g., _CheckpointState.setup_resume / _CheckpointState.from_folder or any load
calls inside setup_resume/save) and add the short justification comment directly
above each torch.load call.

🧹 Nitpick comments (1)

modelopt/torch/quantization/utils/layerwise_calib.py (1)

591-630: LGTM!

The save method correctly:

Uses enable_weight_access_and_writeback context for managed-weight frameworks
Moves all data to CPU before storage
Has a defensive fallback for missing output_meta (line 617-618)

The fallback creates dummy metadata if output_meta is None, which could mask state-machine bugs. Consider logging a warning in this case.

Optional: Add warning for missing output_meta

         output_meta = getattr(layer._seq_calib, "output_meta", None)
         if output_meta is None:
+            print_rank_0(
+                f"Warning: layer {layer_idx} has no output_meta; using fallback. "
+                "This may indicate the layer was not run in 'run' mode."
+            )
             output_meta = LayerActivationCollector._extract_output_meta(torch.zeros(1))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/quantization/utils/layerwise_calib.py` around lines 591 - 630,
In save (method save in layerwise_calib.py) add a warning when output_meta is
missing before calling LayerActivationCollector._extract_output_meta: detect if
getattr(layer._seq_calib, "output_meta", None) is None, log a warning (e.g.,
logger = logging.getLogger(__name__); logger.warning(...)) that includes
layer_idx and the layer identifier and states that dummy metadata is being
created, then proceed to call LayerActivationCollector._extract_output_meta;
this keeps behavior unchanged but surfaces the unexpected state-machine issue.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Line 555: Add an inline comment next to the torch.load call that sets
weights_only=False (the line loading next_inputs from next_inputs_path)
explaining that this file is produced internally by _save_layer and therefore
may contain non-tensor objects from the model's forward pass which require
pickle; explicitly state that the file source is trusted and why using
weights_only=False is safe in this context to satisfy the security guideline.
- Around line 544-546: Add an inline comment immediately above the
torch.load(...) call that sets weights_only=False (the line assigning meta =
torch.load(...)) explaining that this is safe because output_meta.pt is produced
internally by this module's _save_layer function (so it is not user-supplied and
controlled), that the file may contain arbitrary Python objects under the
("other", output) metadata path and therefore requires pickle deserialization,
and that this trusted-origin justification satisfies the SECURITY.md requirement
for using weights_only=False.

---

Outside diff comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1566-1632: Add inline comments immediately before the two
torch.load(..., weights_only=False) calls in
modelopt.torch.quantization.utils.layerwise_calib (look around the
_CheckpointState usage and the methods that load "output_meta.pt" and
"next_inputs.pt") stating that these checkpoint files ("output_meta.pt" and
"next_inputs.pt") are generated and managed internally by the sequential
calibration process, are not user-supplied, and therefore are trusted for safe
deserialization; locate the calls near methods that restore checkpoint state
(e.g., _CheckpointState.setup_resume / _CheckpointState.from_folder or any load
calls inside setup_resume/save) and add the short justification comment directly
above each torch.load call.

---

Nitpick comments:
In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 591-630: In save (method save in layerwise_calib.py) add a warning
when output_meta is missing before calling
LayerActivationCollector._extract_output_meta: detect if
getattr(layer._seq_calib, "output_meta", None) is None, log a warning (e.g.,
logger = logging.getLogger(__name__); logger.warning(...)) that includes
layer_idx and the layer identifier and states that dummy metadata is being
created, then proceed to call LayerActivationCollector._extract_output_meta;
this keeps behavior unchanged but surfaces the unexpected state-machine issue.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6e6e083b-89b1-4ebe-9b11-a051411fcf87

📥 Commits

Reviewing files that changed from the base of the PR and between b6c6ec3 and 6280846.

📒 Files selected for processing (19)

modelopt/torch/quantization/config.py
modelopt/torch/quantization/mode.py
modelopt/torch/quantization/model_calib.py
modelopt/torch/quantization/plugins/accelerate.py
modelopt/torch/quantization/plugins/huggingface.py
modelopt/torch/quantization/utils/__init__.py
modelopt/torch/quantization/utils/activation_collector.py
modelopt/torch/quantization/utils/calib_utils.py
modelopt/torch/quantization/utils/core_utils.py
modelopt/torch/quantization/utils/layerwise_calib.py
modelopt/torch/utils/network.py
tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py
tests/gpu/torch/quantization/test_fsdp2.py
tests/gpu/torch/quantization/test_sequential_calibrate.py
tests/unit/torch/quantization/plugins/test_huggingface.py
tests/unit/torch/quantization/test_calib.py
tests/unit/torch/quantization/test_sequential_calibrate.py
tests/unit/torch/quantization/test_sequential_checkpoint.py
tests/unit/torch/quantization/test_utils.py

💤 Files with no reviewable changes (1)

modelopt/torch/quantization/utils/activation_collector.py

coderabbitai

🧹 Nitpick comments (1)

tests/unit/torch/quantization/test_sequential_calibrate.py (1)

585-590: Optional: guarantee cleanup with try/finally in restore test.

Use the same cleanup pattern as other tests so unpatch always runs if collection fails mid-test.

♻️ Suggested change

     collector = LayerActivationCollector(model)
     collector._patch_all_layers()
-    for layer in originals:
-        collector.get_input_activations(layer, forward_loop)
-    collector._unpatch_all_layers()
+    try:
+        for layer in originals:
+            collector.get_input_activations(layer, forward_loop)
+    finally:
+        collector._unpatch_all_layers()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/quantization/test_sequential_calibrate.py` around lines 585
- 590, The test currently calls collector._patch_all_layers(), runs collection,
and then calls collector._unpatch_all_layers() but does not guarantee cleanup if
collection fails; wrap the collection calls in a try/finally so that
_unpatch_all_layers() is always invoked even on exceptions. Specifically, after
calling LayerActivationCollector(model) and collector._patch_all_layers(),
perform the loop that calls collector.get_input_activations(layer, forward_loop)
over originals inside a try block and call collector._unpatch_all_layers() in
the finally block to ensure restoration.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/torch/quantization/test_sequential_calibrate.py`:
- Around line 585-590: The test currently calls collector._patch_all_layers(),
runs collection, and then calls collector._unpatch_all_layers() but does not
guarantee cleanup if collection fails; wrap the collection calls in a
try/finally so that _unpatch_all_layers() is always invoked even on exceptions.
Specifically, after calling LayerActivationCollector(model) and
collector._patch_all_layers(), perform the loop that calls
collector.get_input_activations(layer, forward_loop) over originals inside a try
block and call collector._unpatch_all_layers() in the finally block to ensure
restoration.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b3aad426-8516-47cb-93d8-486cadc7717d

📥 Commits

Reviewing files that changed from the base of the PR and between 6280846 and 6515d4d.

📒 Files selected for processing (2)

tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py
tests/unit/torch/quantization/test_sequential_calibrate.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py

realAsma · 2026-04-17T16:55:00Z

@shengliangxu I am overwriting the layer wise save from the yaml recipe this way. Does this look goo to you?

realAsma · 2026-04-17T16:58:11Z

@meenchen @sugunav14 — added _supports_layerwise (per-mode opt-out) on BaseCalibrateModeDescriptor in modelopt/torch/quantization/mode.py. Defaults to True; SVDQuantModeDescriptor opts out. wrapped_calib_func now raises a clear ValueError when layerwise=True is requested on an unsupported mode. See commit baaf80f4.

Edwardf0t1

LGTM in general. @realAsma It would be great to test GLM5.1 as well. cc @Fridah-nv

There's a draft PR for GLM5/5.1:
#985

Introduces layerwise calibration to enable PTQ on models that do not fit in GPU memory, plus supporting infrastructure: - New modelopt/torch/quantization/utils/layerwise_calib.py with layer-by-layer calibration and per-mode opt-out - Disk offloading support in enable_weight_access_and_writeback - Memory-efficient inplace fakequant export with disk offload - Meta device detection in layerwise restore - Fix meta tensor crash when exporting offloaded vLLM fakequant checkpoints - Fix json.dumps sort_keys error with mixed int/str keys in quant_cfg - Rename test_sequential_calibrate -> test_layerwise_calibrate (unit + gpu) - Remove obsolete activation_collector.py Signed-off-by: realAsma <akuriparambi@nvidia.com>

Max calibration is fast enough that checkpointing each layer adds unnecessary I/O and disk usage. Comment explains why it is omitted. Signed-off-by: realAsma <akuriparambi@nvidia.com>

Adds test_hf_vllm_export_offload covering the inplace_mem_efficient=True path of export_hf_vllm_fq_checkpoint on a CPU-offloaded tiny LLaMA. The test asserts the inplace path actually mutates offloaded layer weights (falsifying a silent fall-through to the copy path), that the reloaded HF model matches a deepcopy+fold_weight reference built inside enable_weight_access_and_writeback (materializes meta tensors before folding), and that the saved quantizer state preserves input amaxes. Also adds a CHANGELOG.rst bullet under 0.44 New Features describing the layerwise calibration feature and linking to the experts-only recipe. Signed-off-by: realAsma <akuriparambi@nvidia.com>

Show the two recipes separately: first the plain layerwise recipe for the base feature, then the intermediate-progress-saving detail with the GPTQ recipe that demonstrates it. Signed-off-by: realAsma <akuriparambi@nvidia.com>

Monkey-patch save_pretrained to a no-op so the test exercises only the PR's new inplace_mem_efficient=True contribution (per-layer enable_weight_access_and_writeback dispatch + inplace fake-quant writeback) without tripping transformers load_offloaded_parameter on SequentialHook — a pre-existing upstream limitation unrelated to this PR's new code. Broaden the folded-weights assertion to cover all decoder layers (not just the offloaded layer 0) so regressions in the on-GPU inplace path are also caught. The vllm_fq_modelopt_state.pth contents are still asserted since torch.save happens before save_pretrained. Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma · 2026-04-18T00:14:59Z

    modelopt_state = mto.modelopt_state(model)
    # ``modelopt_state`` may be stale if another mode (e.g. calibrate) ran last. Rebuild
    # ``quantizer_state`` and drop disabled weight quantizer entries (weights already folded).
    qstate = quantizer_state(model)
    for key in list(qstate):
        if key.endswith("weight_quantizer") and qstate[key].get("_disabled"):
            qstate.pop(key)

    for mode_str, m_state in modelopt_state.get("modelopt_state_dict", []):
        if mode_str == "quantize" and "metadata" in m_state:
            m_state["metadata"]["quantizer_state"] = qstate
            break


@kinjalpatel27 why are we doing this specially for quantize mode?

we need to remove disabled weight_quantizer from metadata, otherwise the reload creates an issue.

… validator Per reviewer feedback on #1251, re-introduce use_sequential as a real field marked deprecated and frozen. Pydantic emits DeprecationWarning on use and blocks post-construction reassignment; a model_validator(mode='after') copies the legacy value into layerwise when layerwise was not set. Replaces the mode='before' key-rename validator added in b4c6a03. Signed-off-by: realAsma <akuriparambi@nvidia.com>

…guard - config: accept legacy `use_sequential` via AliasChoices on `layerwise` so pre-#1251 PTQ checkpoints load; still serializes as `layerwise` - recipes: split nvfp4_experts_only-fp8_kv into default (no layerwise) and _layerwise variants - hf_ptq: auto batch-size detection not supported with layerwise; default to batch_size=1 in that case - tests: cover alias accept, current-name accept, dump under current name, and extra='forbid' still rejecting unknowns Signed-off-by: realAsma <akuriparambi@nvidia.com>

## Summary Adds **performant layerwise calibration** for quantizing large models (e.g. DeepSeek-R1 671B) that don't fit entirely on GPU. ([Example commands](#example-commands)) 1. **Performant calibration for large models** — Each decoder layer is moved from CPU/disk to GPU (accelerate) or unsharded (FSDP2) **only once** and kept on GPU for the entire calibration step. Previously, every calibration batch triggered weight transfer for every layer — O(num_batches) weight movements per layer. Now it is O(1) per layer. This also means you can **increase batch size** since only one layer's weights occupy GPU at a time — e.g. DeepSeek-R1 on a single node (8×80GB) with `batch_size=16` and `gpu_max_mem_percentage=0.5`. 2. **Checkpoint save/resume** — Saves progress after each layer, so jobs that exceed cluster time limits (e.g. 4-hour Slurm windows for 100+ layer MoE models) can resume from the last completed layer. 3. **Rename** `sequential_calibrate` → `layerwise_calibrate` for clarity. ### Design details The existing layerwise state machine (skip/run/capture) already processes one layer at a time, but skip-mode layers still kept their parameters in the ModuleList — so frameworks transferred all weights every forward pass. This PR adds: - **`_SkipLayer`**: replaces fully-calibrated layers with a parameter-free dummy in the ModuleList, so framework hooks have nothing to transfer - **`persistent_materialization`**: keeps the active layer on GPU for the entire calibration step, avoiding repeated offload/reload cycles Checkpoint save is per-layer; restore is bulk — quantizer state and weights for layers 0..K-1 are restored once at the end of calibration, keeping the hot path fast. ### Example commands **Qwen3-8B** (NVFP4+GPTQ, single GPU): ```bash python hf_ptq.py \ --pyt_ckpt_path Qwen/Qwen3-8B \ --recipe nvfp4_gptq_sequential.yaml \ --calib_size 64 \ --batch_size 16 \ --dataset cnn_dailymail \ --export_path outputs/qwen3_8b_nvfp4_gptq_seq \ --gpu_max_mem_percentage 0.5 \ --use_seq_device_map \ --vllm_fakequant_export ``` **DeepSeek-R1** (NVFP4 experts-only + FP8 KV, 8×80GB): ```bash python hf_ptq.py \ --model unsloth/DeepSeek-R1-0528-BF16 \ --recipe ../../modelopt_recipes/general/ptq/nvfp4_experts_only-fp8_kv.yaml \ --dataset cnn_dailymail \ --batch_size 16 \ --calib_size 64 \ --calib_seq 512 \ --gpu_max_mem_percentage 0.5 \ --use_seq_device_map \ --trust_remote_code \ --export_path output/DeepSeek-R1-BF16-nvfp4-experts-only-fp8-kv \ --vllm_fakequant_export ``` ### Example: NVFP4+GPTQ layerwise calibration on Qwen3-8B (36 layers, single GPU — 20 GB peak) **Initial run** (killed after layer 11): ``` Layerwise calibration: Found 36 transformer layers Calibrating layer 1/36 | capture: [1] Computing Hessians for 7 linear layers... GPTQ time: 51.39s Calibrating layer 2/36 | run: [1] | capture: [2] Checkpoint: saved layer 0 GPTQ time: 50.06s Calibrating layer 3/36 | skip: 1 | run: [2] | capture: [3] Checkpoint: saved layer 1 ... Calibrating layer 12/36 | skip: 10 | run: [11] | capture: [12] Checkpoint: saved layer 10 <killed> ``` **Resumed run** (picks up from layer 11, finishes all 36): ``` Layerwise calibration: Found 36 transformer layers Checkpoint: resuming layerwise calibration from layer 11/36 Calibrating layer 12 (resumed) GPTQ time: 51.45s Calibrating layer 13/36 | skip: 11 | run: [12] | capture: [13] Checkpoint: saved layer 11 ... Calibrating layer 36/36 | skip: 34 | run: [35] | capture: [36] Checkpoint: saved layer 34 GPTQ time: 50.33s Checkpoint: saved layer 35 (final) Checkpoint: restored 11 previously calibrated layers Layerwise calibration completed Quantized model exported to: outputs/qwen3_8b_nvfp4_gptq_seq GPU 0: Peak memory usage = 20.42 GB ``` ## TODO - [ ] Update CHANGELOG ## Test plan - `tests/unit/torch/quantization/test_layerwise_calibrate.py` — unit tests for skip/swap/restore - `tests/unit/torch/quantization/test_sequential_checkpoint.py` — checkpoint save/resume correctness - `tests/gpu/torch/quantization/plugins/test_accelerate_gpu.py` — CPU-offloaded layerwise + GPTQ + checkpoint resume - `tests/gpu/torch/quantization/test_fsdp2.py` — FSDP2 layerwise calibration ### Verified - [x] Qwen3-8B: layerwise calibration + checkpoint save/restore + fakequantized checkpoint export + vLLM serve - [x] DeepSeek-R1: checkpoint resume tested - [x] DeepSeek-R1: fakequantized checkpoint export verified --------- Signed-off-by: realAsma <akuriparambi@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>

…ributeError on custom configs (#1324) ### What does this PR do? Type of change: Bug fix   - Summary: Running hf_ptq.py on stepfun-ai/Step-3.5-Flash (and any model whose custom HF config doesn't assign use_cache) crashed in get_max_batch_size() with AttributeError: 'Step3p5Config' object has no attribute 'use_cache' before calibration could start. - Extract the existing "disable KV cache during calibration" logic into a _disable_use_cache(model) context manager, apply it to both get_max_batch_size and _forward_loop. The CM sets config.use_cache = False unconditionally (not only when the attribute exists) and restores the prior value on exit if one was set. - Behavior unchanged for normal configs; the NemotronH hybrid-cache correctness guarantee from #1251 is preserved. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  Step-3.5-Flash PTQ now passes get_max_batch_size ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Improved memory handling during model evaluation and calibration by consistently disabling KV cache for both single-batch probes and full dataloader runs, simplifying and stabilizing inference flow and ensuring cache state is managed reliably. * **Tests** * Added unit tests verifying cache-state handling across models with and without cache settings, including correct restoration behavior even when errors occur.  --------- Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com>

realAsma force-pushed the asma/ptq-large-models branch 2 times, most recently from 8eabe76 to 6ec3721 Compare April 14, 2026 16:49