Skip to content

[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353

Open
ajrasane wants to merge 1 commit intomainfrom
ajrasane/nvbug_6110209
Open

[6110209] Patch zero FP16 scales in INT4_AWQ ONNX export#1353
ajrasane wants to merge 1 commit intomainfrom
ajrasane/nvbug_6110209

Conversation

@ajrasane
Copy link
Copy Markdown
Contributor

@ajrasane ajrasane commented Apr 27, 2026

What does this PR do?

Type of change: Bug fix

Fix replace_zero_scale_with_smallest_nonzero() in modelopt/onnx/quantization/qdq_utils.py so the FP16-scale sanitizer actually runs for INT4_AWQ ONNX exports.

The function is supposed to ensure all FP16 scales are strictly positive before the model reaches TensorRT, since trtexec --stronglyTyped asserts scaleAllPositive. It had two latent bugs that made it a complete no-op for INT4_AWQ:

  1. It only collected scales from QuantizeLinear consumers — but INT4_AWQ exports use DequantizeLinear (default domain) and TRT_INT4DequantizeLinear (trt:: domain). There are zero QuantizeLinear nodes in such graphs, so the collected set was empty.
  2. It only patched scales emitted by Constant nodes — but INT4_AWQ stores scales as graph initializers.

When the FP32→FP16 cast in _convert_fp32_init_to_fp16() underflowed small amax values to 0.0 (FP16 min subnormal is 5.96e-8), those zeros sailed through into the exported ONNX. TRT then rejected the model with:

Assertion failed: (scaleAllPositive || allowNegativeScale): Scale coefficients must all be positive

This PR extends the sanitizer to:

  • Walk QuantizeLinear / DequantizeLinear / TRT_INT4QuantizeLinear / TRT_INT4DequantizeLinear nodes when collecting scale tensor names.
  • Patch zero entries in float-typed graph initializers in addition to Constant-node values, preserving the original dtype.

Files modified:

  • modelopt/onnx/quantization/qdq_utils.py — fix.
  • tests/unit/onnx/quantization/test_qdq_utils.pyTestReplaceZeroScaleWithSmallestNonzero regression tests.
  • CHANGELOG.rst — bug-fix entry under 0.45.

Usage

# Repro that previously failed and now succeeds:
python torch_quant_to_onnx.py \
    --quantize_mode=int4_awq \
    --timm_model_name=vit_base_patch16_224 \
    --onnx_save_path=/tmp/vit_base_patch16_224.int4_awq.onnx \
    --calibration_data_size=32

trtexec --onnx=/tmp/vit_base_patch16_224.int4_awq.onnx --stronglyTyped --skipInference

Testing

  • New tests pass: pytest tests/unit/onnx/quantization/test_qdq_utils.py::TestReplaceZeroScaleWithSmallestNonzero -v (3 passed).
  • Full file: pytest tests/unit/onnx/quantization/test_qdq_utils.py (25 passed).
  • Broader sanity: pytest tests/unit/onnx/quantization/ (288 passed).
  • Smoke test on a synthetic model with explicit zero scales: 3 zeros → 0 zeros, all positive, FP16 dtype preserved.
  • pre-commit run --files <changed> clean.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: ✅

Additional Information

  • NVBug: 6110209
  • Reproduces with nvidia-modelopt==0.44.0rc1, TensorRT 10.15.1.29, on B200 / H20.
  • Root cause introduced when the TRT_INT4DequantizeLinear export path was added (PR [OMNIML-2244] Implement the ONNX quantization exporter for INT4 #575, commit 0a4f0a8b); that PR didn't update the sanitizer to handle the new node type or scale storage.

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling of zero scale values in quantization operations across additional operator types and scale sources.
  • Tests

    • Added comprehensive test coverage for quantization scale handling across different operator configurations and scale sources.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 27, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

The replace_zero_scale_with_smallest_nonzero function is enhanced to support scale tensors from multiple quantization/dequantization operator types (including standard and TRT INT4 variants) and from graph initializers, not just Constant nodes. The function now detects zero values across both initializer and Constant tensor representations, replacing them with the smallest nonzero fp16 value while preserving original tensor dtypes.

Changes

Cohort / File(s) Summary
Core QDQ Scale Patching Logic
modelopt/onnx/quantization/qdq_utils.py
Expanded replace_zero_scale_with_smallest_nonzero to collect scale tensor names from multiple Q/DQ operator types, added iteration over graph initializers to detect and patch zero scales in floating-point tensors, and applied same dtype-aware zero replacement logic to both initializer and Constant node code paths.
QDQ Utils Tests
tests/unit/onnx/quantization/test_qdq_utils.py
Added regression test coverage for scale patching via graph initializers consumed by DequantizeLinear and TRT_INT4DequantizeLinear, and verified existing Constant node scale path remains functional after patching logic updates.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Security Anti-Patterns ❓ Inconclusive Unable to retrieve git diff or file contents - shell commands not available in this environment Provide git diff output or file contents directly for security review analysis
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately identifies the main change: patching zero FP16 scales in INT4_AWQ ONNX export, which directly matches the core bug fix in the changeset.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ajrasane/nvbug_6110209

Comment @coderabbitai help to get the list of available commands and usage tips.

@ajrasane ajrasane added the cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Apr 27, 2026
@ajrasane ajrasane self-assigned this Apr 27, 2026
@ajrasane ajrasane marked this pull request as ready for review April 27, 2026 15:34
@ajrasane ajrasane requested a review from a team as a code owner April 27, 2026 15:34
@ajrasane ajrasane requested a review from cjluo-nv April 27, 2026 15:34
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 27, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1353/

Built to branch gh-pages at 2026-04-27 17:42 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)

1058-1059: Consider expanding parametrization to include QuantizeLinear variants.

The production collector now supports QuantizeLinear and TRT_INT4QuantizeLinear too; adding them here would fully lock in op-type coverage for initializer-backed scales.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1058 - 1059,
Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1058-1059: Update the parametrization on the test named
test_zero_scale_initializer_fed_to_dq_is_patched to also cover QuantizeLinear
variants: include "QuantizeLinear" and "TRT_INT4QuantizeLinear" in the
`@pytest.mark.parametrize` list (alongside "DequantizeLinear" and
"TRT_INT4DequantizeLinear") so the test exercises initializer-backed scales for
both quantize and dequantize op types; locate the decorator above the test
function to modify the list of dq_op_type values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 97fc3c23-f6d7-4116-a9a8-de199e6220e4

📥 Commits

Reviewing files that changed from the base of the PR and between 5b41ba4 and a3385aa.

📒 Files selected for processing (3)
  • CHANGELOG.rst
  • modelopt/onnx/quantization/qdq_utils.py
  • tests/unit/onnx/quantization/test_qdq_utils.py

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.78%. Comparing base (1ec931c) to head (9fc8306).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1353      +/-   ##
==========================================
+ Coverage   75.72%   75.78%   +0.06%     
==========================================
  Files         471      471              
  Lines       50375    50383       +8     
==========================================
+ Hits        38146    38185      +39     
+ Misses      12229    12198      -31     
Flag Coverage Δ
examples 41.61% <100.00%> (+0.99%) ⬆️
gpu 58.40% <33.33%> (-0.51%) ⬇️
unit 52.77% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread CHANGELOG.rst Outdated

- Add offline DFlash speculative decoding training. Train the draft module from pre-computed base-model hidden states dumped by ``examples/speculative_decoding/collect_hidden_states/compute_hidden_states_hf.py``; base-model transformer layers are deleted after conversion to save memory. Controlled by the auto-derived ``dflash_offline`` flag on ``DFlashConfig`` (derived from ``data_args.offline_data_path``). The dump scripts now share ``collect_hidden_states/common.py`` for aux-layer selection (``--aux-layers eagle|dflash|<list>``) and optional assistant-token ``loss_mask`` for answer-only-loss training.

**Bug Fixes**
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesnt this need to go in 0.44 section below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not include this in the changelog. Removed.

replace_zero_scale_with_smallest_nonzero() in qdq_utils.py only inspected
QuantizeLinear consumers and Constant-node producers, which made it a
no-op for INT4_AWQ exports — those use DequantizeLinear (default and
trt:: domain) consumers and store scales as graph initializers. Zero
scales produced when the FP32→FP16 cast underflows therefore reached
TensorRT, causing trtexec --stronglyTyped to fail with "Scale
coefficients must all be positive".

Extend the sanitizer to also walk DequantizeLinear / TRT_INT4DequantizeLinear
nodes and to patch initializer-backed scales, while preserving dtype.

Add regression tests for both the initializer + DQ path (default and
trt:: domain) and the legacy Constant + QuantizeLinear path.

Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
@ajrasane ajrasane force-pushed the ajrasane/nvbug_6110209 branch from a3385aa to 9fc8306 Compare April 27, 2026 17:39
@ajrasane ajrasane changed the title [Fix]: Patch zero FP16 scales in INT4_AWQ ONNX export (NVBug 6110209) [6110209] Patch zero FP16 scales in INT4_AWQ ONNX export Apr 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/onnx/quantization/test_qdq_utils.py (1)

1076-1110: LGTM! Legacy Constant-node path is properly regression-tested.

This ensures the existing code path continues to work after the refactoring.

Consider adding a float32 scale test case to strengthen dtype preservation coverage. The current tests only use float16 scales.

🧪 Optional: Add float32 scale test
def test_float32_scale_dtype_preserved(self):
    """Verify float32 scales remain float32 after patching."""
    from modelopt.onnx.quantization.qdq_utils import replace_zero_scale_with_smallest_nonzero

    # Build model with float32 scale containing zeros
    weight_data = np.random.randint(-8, 8, size=(3, 4), dtype=np.int8)
    weight_tensor = numpy_helper.from_array(weight_data, "weight")
    scale_data = np.array([0.0, 1e-3, 0.0], dtype=np.float32).reshape(3, 1)
    scale_tensor = numpy_helper.from_array(scale_data, "scale")

    input_tensor = helper.make_tensor_value_info("input", TensorProto.FLOAT, [None, 3])
    dq_node = helper.make_node(
        "DequantizeLinear", inputs=["weight", "scale"], outputs=["dq_output"], name="dq"
    )
    graph = helper.make_graph(
        nodes=[dq_node],
        name="test_graph",
        inputs=[input_tensor],
        outputs=[helper.make_tensor_value_info("dq_output", TensorProto.FLOAT, [3, 4])],
        initializer=[weight_tensor, scale_tensor],
    )
    model = helper.make_model(graph)

    patched = replace_zero_scale_with_smallest_nonzero(model)

    scale_init = next(i for i in patched.graph.initializer if i.name == "scale")
    assert scale_init.data_type == TensorProto.FLOAT  # float32 preserved
    scale_arr = numpy_helper.to_array(scale_init)
    assert not (scale_arr == 0).any()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/test_qdq_utils.py` around lines 1076 - 1110, Add
a new unit test function (e.g. test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/onnx/quantization/test_qdq_utils.py`:
- Around line 1076-1110: Add a new unit test function (e.g.
test_float32_scale_dtype_preserved) in
tests/unit/onnx/quantization/test_qdq_utils.py that constructs a model with a
float32 scale initializer (use numpy_helper.from_array with dtype float32)
consumed by a DequantizeLinear node, calls
replace_zero_scale_with_smallest_nonzero(model), then locate the patched
initializer by name (e.g. "scale") and assert its data_type == TensorProto.FLOAT
and that the scale array contains no zeros; reference helpers used in the file
such as replace_zero_scale_with_smallest_nonzero, numpy_helper.from_array,
DequantizeLinear, and TensorProto.FLOAT to match existing patterns.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 08f6d273-b45c-4536-9b8a-416dea5d5508

📥 Commits

Reviewing files that changed from the base of the PR and between a3385aa and 9fc8306.

📒 Files selected for processing (2)
  • modelopt/onnx/quantization/qdq_utils.py
  • tests/unit/onnx/quantization/test_qdq_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.44.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants