[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353
[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthroughThe Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 inconclusive)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)
1058-1059: Consider expanding parametrization to include QuantizeLinear variants.The production collector now supports
QuantizeLinearandTRT_INT4QuantizeLineartoo; adding them here would fully lock in op-type coverage for initializer-backed scales.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1058 - 1059, Update the parametrization on the test named test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the `@pytest.mark.parametrize` list (alongside "DequantizeLinear" and "TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for both quantize and dequantize op types; locate the decorator above the test function to modify the list of dq_op_type values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1058-1059: Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 97fc3c23-f6d7-4116-a9a8-de199e6220e4
📒 Files selected for processing (3)
CHANGELOG.rstmodelopt/onnx/quantization/qdq_utils.pytests/unit/onnx/quantization/test_qdq_utils.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1353 +/- ##
==========================================
+ Coverage 75.72% 75.78% +0.06%
==========================================
Files 471 471
Lines 50375 50383 +8
==========================================
+ Hits 38146 38185 +39
+ Misses 12229 12198 -31
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| - Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training. | ||
|
|
||
| **Bug Fixes** |
There was a problem hiding this comment.
Doesnt this need to go in 0.44 section below?
There was a problem hiding this comment.
I think we should not include this in the changelog. Removed.
replace_zero_scale_with_smallest_nonzero() in qdq_utils.py only inspected QuantizeLinear consumers and Constant-node producers, which made it a no-op for INT4_AWQ exports — those use DequantizeLinear (default and trt:: domain) consumers and store scales as graph initializers. Zero scales produced when the FP32→FP16 cast underflows therefore reached TensorRT, causing trtexec --stronglyTyped to fail with "Scale coefficients must all be positive". Extend the sanitizer to also walk DequantizeLinear / TRT_INT4DequantizeLinear nodes and to patch initializer-backed scales, while preserving dtype. Add regression tests for both the initializer + DQ path (default and trt:: domain) and the legacy Constant + QuantizeLinear path. Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
a3385aa to
9fc8306
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)
1076-1110: LGTM! Legacy Constant-node path is properly regression-tested.This ensures the existing code path continues to work after the refactoring.
Consider adding a float32 scale test case to strengthen dtype preservation coverage. The current tests only use float16 scales.
🧪 Optional: Add float32 scale test
def test_float32_scale_dtype_preserved(self): """Verify float32 scales remain float32 after patching.""" from modelopt.onnx.quantization.qdq_utils import replace_zero_scale_with_smallest_nonzero # Build model with float32 scale containing zeros weight_data = np.random.randint(-8, 8, size=(3, 4), dtype=np.int8) weight_tensor = numpy_helper.from_array(weight_data, "weight") scale_data = np.array([0.0, 1e-3, 0.0], dtype=np.float32).reshape(3, 1) scale_tensor = numpy_helper.from_array(scale_data, "scale") input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, [None, 3]) dq_node = helper.make_node( "DequantizeLinear", inputs=["weight", "scale"], outputs=["dq_output"], name="dq" ) graph = helper.make_graph( nodes=[dq_node], name="test_graph", inputs=[input_tensor], outputs=[helper.make_tensor_value_info("dq_output", TensorProto.FLOAT, [3, 4])], initializer=[weight_tensor, scale_tensor], ) model = helper.make_model(graph) patched = replace_zero_scale_with_smallest_nonzero(model) scale_init = next(i for i in patched.graph.initializer if i.name == "scale") assert scale_init.data_type == TensorProto.FLOAT # float32 preserved scale_arr = numpy_helper.to_array(scale_init) assert not (scale_arr == 0).any()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1076 - 1110, Add a new unit test function (e.g. test_float32_scale_dtype_preserved) in tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a float32 scale initializer (use numpy_helper.from_array with dtype float32) consumed by a DequantizeLinear node, calls replace_zero_scale_with_smallest_nonzero(model), then locate the patched initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT and that the scale array contains no zeros; reference helpers used in the file such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array, DequantizeLinear, and TensorProto.FLOAT to match existing patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1076-1110: Add a new unit test function (e.g.
test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 08f6d273-b45c-4536-9b8a-416dea5d5508
📒 Files selected for processing (2)
modelopt/onnx/quantization/qdq_utils.pytests/unit/onnx/quantization/test_qdq_utils.py
What does this PR do?
Type of change: Bug fix
Fix
replace_zero_scale_with_smallest_nonzero()inmodelopt/onnx/quantization/qdq_utils.pyso the FP16-scale sanitizer actually runs for INT4_AWQ ONNX exports.The function is supposed to ensure all FP16 scales are strictly positive before the model reaches TensorRT, since
trtexec --stronglyTypedassertsscaleAllPositive. It had two latent bugs that made it a complete no-op for INT4_AWQ:QuantizeLinearconsumers — but INT4_AWQ exports useDequantizeLinear(default domain) andTRT_INT4DequantizeLinear(trt:: domain). There are zeroQuantizeLinearnodes in such graphs, so the collected set was empty.Constantnodes — but INT4_AWQ stores scales as graph initializers.When the FP32→FP16 cast in
_convert_fp32_init_to_fp16()underflowed small amax values to0.0(FP16 min subnormal is 5.96e-8), those zeros sailed through into the exported ONNX. TRT then rejected the model with:This PR extends the sanitizer to:
QuantizeLinear/DequantizeLinear/TRT_INT4QuantizeLinear/TRT_INT4DequantizeLinearnodes when collecting scale tensor names.Constant-node values, preserving the original dtype.Files modified:
modelopt/onnx/quantization/qdq_utils.py— fix.tests/unit/onnx/quantization/test_qdq_utils.py—TestReplaceZeroScaleWithSmallestNonzeroregression tests.CHANGELOG.rst— bug-fix entry under 0.45.Usage
# Repro that previously failed and now succeeds: python torch_quant_to_onnx.py \ --quantize_mode=int4_awq \ --timm_model_name=vit_base_patch16_224 \ --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \ --calibration_data_size=32 trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInferenceTesting
pytest tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero -v(3 passed).pytest tests/unit/onnx/quantization/test_qdq_utils.py(25 passed).pytest tests/unit/onnx/quantization/(288 passed).pre-commit run --files <changed>clean.Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: N/AAdditional Information
nvidia-modelopt==0.44.0rc1, TensorRT 10.15.1.29, on B200 / H20.TRT_INT4DequantizeLinearexport path was added (PR [OMNIML-2244] Implement the ONNX quantization exporter for INT4 #575, commit0a4f0a8b); that PR didn't update the sanitizer to handle the new node type or scale storage.Summary by CodeRabbit
Bug Fixes
Tests