Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthrough
Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
modelopt/torch/quantization/model_calib.py (2)
1658-1669:cache.reset()may be redundant after forcingpast_key_values=None; add justification/defensive handling.In the replay loop you:
- copy
kwargs_input- call
cache.reset()(if present)- then set
kwargs_input["past_key_values"] = NoneSince the cache object is not passed into the layer after setting to
None,reset()is only useful if you’re relying on it for buffer cleanup/releasing resources before dropping the reference. If that’s the intent, a brief comment would help future readers.If
reset()can throw for some cache implementations, consider guarding it (minor hardening) so calibration doesn’t fail unexpectedly on a cache type that exposesresetbut can’t safely run it:Optional hardening diff
cache = kwargs_input["past_key_values"] - if cache is not None and hasattr(cache, "reset"): - cache.reset() + if cache is not None and hasattr(cache, "reset"): + try: + cache.reset() + except Exception: + # Best-effort cleanup; we will force past_key_values=None anyway. + pass kwargs_input["past_key_values"] = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1658 - 1669, The loop clears past_key_values by setting kwargs_input["past_key_values"]=None but also calls cache.reset() which is either redundant or risky; either remove the reset() call and add a short comment noting we drop the cache by setting past_key_values to None, or harden it by wrapping cache.reset() in a try/except (logging or ignoring exceptions) and add a comment explaining it's defensive cleanup before dropping the reference; update the replay loop around _inputs, kwargs_input, cache, and the call site m(*args, **kwargs_input) accordingly so behavior is explicit and safe.
1646-1653: OK to skip bootstrapping when no layers remain, but consider an early-exit optimization.With
start_layer >= num_layers, you setlayer_inputs = Noneand the per-layer loop becomes a no-op. That’s logically consistent.Optional: you could short-circuit earlier (after patching/unpatching strategy review) to avoid unnecessary
input_getter._patch_all_layers(...)work when there’s nothing to replay/capture. Not required for correctness, but it can reduce overhead for fully-calibrated restarts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/quantization/model_calib.py` around lines 1646 - 1653, The per-layer loop is a no-op when start_layer >= num_layers but you still call input_getter._patch_all_layers(...) and do bootstrapping work; add an early-exit to skip bootstrapping and patch/unpatch steps when start_layer >= num_layers to avoid unnecessary overhead—check the start_layer >= num_layers condition before calling input_getter._patch_all_layers (and before any bootstrapping/patching logic) and return or bypass the rest of the replay/capture flow so layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated restarts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1658-1669: The loop clears past_key_values by setting
kwargs_input["past_key_values"]=None but also calls cache.reset() which is
either redundant or risky; either remove the reset() call and add a short
comment noting we drop the cache by setting past_key_values to None, or harden
it by wrapping cache.reset() in a try/except (logging or ignoring exceptions)
and add a comment explaining it's defensive cleanup before dropping the
reference; update the replay loop around _inputs, kwargs_input, cache, and the
call site m(*args, **kwargs_input) accordingly so behavior is explicit and safe.
- Around line 1646-1653: The per-layer loop is a no-op when start_layer >=
num_layers but you still call input_getter._patch_all_layers(...) and do
bootstrapping work; add an early-exit to skip bootstrapping and patch/unpatch
steps when start_layer >= num_layers to avoid unnecessary overhead—check the
start_layer >= num_layers condition before calling
input_getter._patch_all_layers (and before any bootstrapping/patching logic) and
return or bypass the rest of the replay/capture flow so
layer_inputs/resumed_inputs/forward_loop work is skipped for fully-calibrated
restarts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f912c3b9-748a-4790-92cf-c62b7dc38086
📒 Files selected for processing (2)
examples/llm_ptq/example_utils.pymodelopt/torch/quantization/model_calib.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1344 +/- ##
==========================================
+ Coverage 66.36% 75.59% +9.22%
==========================================
Files 471 471
Lines 50510 50512 +2
==========================================
+ Hits 33522 38183 +4661
+ Misses 16988 12329 -4659
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Fridah-nv <201670829+Fridah-nv@users.noreply.github.com>
b772ab9 to
e0ab32e
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/llm_ptq/example_utils.py`:
- Around line 865-871: The hash currently drops all non-scalar fields from
algorithm, risking collisions; instead, build alg_for_hash from algorithm (still
skipping layerwise_checkpoint_dir) but for non-JSON-scalars include a
deterministic placeholder that captures their identity — e.g., replace each
non-scalar value with a small stable descriptor like its type name plus a short
hash of repr(value) (or a deterministic serialization), then compute config_hash
from json.dumps(alg_for_hash, sort_keys=True). This preserves uniqueness for
nested/complex fields while keeping the hash deterministic and still excluding
layerwise_checkpoint_dir.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 23011c25-fd14-4b57-b58c-ed28d03dbc57
📒 Files selected for processing (2)
examples/llm_ptq/example_utils.pymodelopt/torch/quantization/model_calib.py
| # Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic). | ||
| alg_for_hash = { | ||
| k: v | ||
| for k, v in sorted(algorithm.items()) | ||
| if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None))) | ||
| } | ||
| config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8] |
There was a problem hiding this comment.
Avoid silent checkpoint-hash collisions from dropped algorithm fields.
The current hash input excludes all non-scalar values, so different algorithm configs can collapse to the same suffix and accidentally reuse incompatible layerwise checkpoints.
💡 Proposed fix
- # Hash only the algorithm dict (scalar fields, fully JSON-serializable and deterministic).
- alg_for_hash = {
- k: v
- for k, v in sorted(algorithm.items())
- if k != "layerwise_checkpoint_dir" and isinstance(v, (str, int, float, bool, type(None)))
- }
- config_hash = hashlib.sha256(json.dumps(alg_for_hash, sort_keys=True).encode()).hexdigest()[:8]
+ def _stable_jsonable(obj):
+ if isinstance(obj, dict):
+ return {k: _stable_jsonable(v) for k, v in sorted(obj.items())}
+ if isinstance(obj, (list, tuple)):
+ return [_stable_jsonable(v) for v in obj]
+ if isinstance(obj, (str, int, float, bool, type(None))):
+ return obj
+ raise TypeError(
+ f"Unsupported algorithm field type for checkpoint hash: {type(obj).__name__}"
+ )
+
+ alg_for_hash = _stable_jsonable(
+ {k: v for k, v in algorithm.items() if k != "layerwise_checkpoint_dir"}
+ )
+ config_hash = hashlib.sha256(
+ json.dumps(alg_for_hash, sort_keys=True, separators=(",", ":")).encode()
+ ).hexdigest()[:8]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/llm_ptq/example_utils.py` around lines 865 - 871, The hash currently
drops all non-scalar fields from algorithm, risking collisions; instead, build
alg_for_hash from algorithm (still skipping layerwise_checkpoint_dir) but for
non-JSON-scalars include a deterministic placeholder that captures their
identity — e.g., replace each non-scalar value with a small stable descriptor
like its type name plus a short hash of repr(value) (or a deterministic
serialization), then compute config_hash from json.dumps(alg_for_hash,
sort_keys=True). This preserves uniqueness for nested/complex fields while
keeping the hash deterministic and still excluding layerwise_checkpoint_dir.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/quantization/model_calib.py`:
- Around line 1651-1664: The current fast-path relying on start_layer ==
num_layers is unreachable because _CheckpointState.from_folder() resets
start_layer to 0 when detect_resume_point() returns None; fix this by having the
checkpoint helper indicate completion instead of hiding it: update
_CheckpointState.from_folder (and detect_resume_point) to set start_layer =
num_layers (or expose a completed flag like ckpt.completed) when calibration is
already complete, and then use that value/flag here (the start_layer check in
model_calib.py that decides whether to call ckpt.setup_resume and
input_getter.get_first_layer_inputs). Reference symbols to change:
detect_resume_point, _CheckpointState.from_folder, _CheckpointState (add
completed or set start_layer=num_layers), ckpt.setup_resume, start_layer, and
input_getter.get_first_layer_inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b57a7734-9fee-4cd2-ac2c-ff625b8bb8ed
📒 Files selected for processing (1)
modelopt/torch/quantization/model_calib.py
| # When all layers are already done (start_layer == num_layers), skip input setup: | ||
| resumed_inputs = ( | ||
| ckpt.setup_resume(transformer_layers) if ckpt and 0 < start_layer < num_layers else None | ||
| ) | ||
|
|
||
| try: | ||
| # Bootstrap: get first layer's inputs (or use resumed inputs). | ||
| layer_inputs = input_getter.get_first_layer_inputs( | ||
| start_layer, resumed_inputs, forward_loop | ||
| ) | ||
| # Skip entirely when all layers are already calibrated (start_layer == num_layers). | ||
| if start_layer < num_layers: | ||
| layer_inputs = input_getter.get_first_layer_inputs( | ||
| start_layer, resumed_inputs, forward_loop | ||
| ) | ||
| else: | ||
| layer_inputs = None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "Inspect detect_resume_point completion behavior:"
rg -n -C4 'def detect_resume_point|last \+ 1 >= total|return None' modelopt/torch/quantization/utils/layerwise_calib.py
echo
echo "Inspect _CheckpointState.from_folder start_layer assignment:"
rg -n -C6 'def from_folder|detect_resume_point|start =|return cls\(.*start_layer' modelopt/torch/quantization/utils/layerwise_calib.pyRepository: NVIDIA/Model-Optimizer
Length of output: 5411
🏁 Script executed:
sed -n '1640,1680p' modelopt/torch/quantization/model_calib.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 2211
🏁 Script executed:
# Double-check: is there any other code path that could set start_layer to num_layers?
rg -n 'start_layer.*=.*num_layers|start_layer.*=.*total' modelopt/torch/quantization/Repository: NVIDIA/Model-Optimizer
Length of output: 340
The start_layer == num_layers fast-path is unreachable due to checkpoint helper behavior.
The code at lines 1651 and 1659 is designed to skip bootstrap when all layers are already calibrated (start_layer == num_layers). However, detect_resume_point() returns None when calibration is complete, and _CheckpointState.from_folder() then sets start_layer = 0 (line 570 of layerwise_calib.py). This means resuming from a complete checkpoint will always have start_layer == 0, causing the bootstrap to execute unnecessarily.
The checkpoint helper should either set start_layer = num_layers when complete, or the code should check for completion via a dedicated flag instead of relying on start_layer == num_layers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/torch/quantization/model_calib.py` around lines 1651 - 1664, The
current fast-path relying on start_layer == num_layers is unreachable because
_CheckpointState.from_folder() resets start_layer to 0 when
detect_resume_point() returns None; fix this by having the checkpoint helper
indicate completion instead of hiding it: update _CheckpointState.from_folder
(and detect_resume_point) to set start_layer = num_layers (or expose a completed
flag like ckpt.completed) when calibration is already complete, and then use
that value/flag here (the start_layer check in model_calib.py that decides
whether to call ckpt.setup_resume and input_getter.get_first_layer_inputs).
Reference symbols to change: detect_resume_point, _CheckpointState.from_folder,
_CheckpointState (add completed or set start_layer=num_layers),
ckpt.setup_resume, start_layer, and input_getter.get_first_layer_inputs.
What does this PR do?
Type of change: ?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit