[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune#951
[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune#951gcunhase merged 43 commits intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds ONNX quantization autotune: CLI presets and flags, autotune workflows and helpers, integration of autotune results into FP8/INT8 quantizers and ORT configuration, defensive graph rewiring, an activation-op helper, and tests plus test-model helpers. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI
participant Entry as quantize CLI
participant Finder as _find_nodes_to_quantize_autotune
participant Autotune as Autotune Workflow
participant Quantizer as FP8/INT8 Quantizer
participant ORT as ORT Configurator
participant Model as ONNX Model
CLI->>Entry: invoke quantize(..., autotune=True)
Entry->>Entry: apply_mode_presets(args)
Entry->>Finder: _find_nodes_to_quantize_autotune(onnx_model,...)
Finder->>Autotune: region_pattern_autotuning_workflow(model_or_path, output_dir?)
Autotune->>Model: analyze regions & patterns
Autotune-->>Finder: resolved insertion points & configs
Finder->>Finder: get_resolved_insertion_points(best=True)
Finder->>Finder: get_ort_quantization_config()
Finder-->>Entry: nodes_to_quantize, no_quantize_inputs, op_types_needing_output_quant
Entry->>Quantizer: quantize(nodes_to_quantize,..., autotune=True)
Quantizer->>ORT: configure_ort(op_types_needing_output_quant)
ORT-->>Quantizer: configuration applied
Quantizer->>Model: apply quantization (skip pattern expansion for autotune)
Quantizer-->>Entry: quantized_model
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/unit/onnx/quantization/autotune/test_region.py (1)
16-21:⚠️ Potential issue | 🟡 MinorRemove duplicate license text block.
Lines 16-21 duplicate the license disclaimer already present in lines 10-14. This appears to be a copy-paste error.
🔧 Proposed fix
# limitations under the License. - -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. """Tests for the Region class in the autotuner."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_region.py` around lines 16 - 21, Remove the duplicated license disclaimer block that was accidentally copy-pasted; locate the repeated Apache license/disclaimer text that appears a second time and delete the redundant block so only the original license header remains at the top of the file (ensure the first license header is preserved and no other content is altered).modelopt/onnx/quantization/fp8.py (1)
219-232:⚠️ Potential issue | 🟠 MajorPotential
AttributeErrorifnodes_to_excludeisNone.Same issue as in
int8.py: line 232 callsnodes_to_exclude.extend()before validation on line 236. Ifnodes_to_excludeis passed asNone, this will fail.🐛 Proposed fix
enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True) + nodes_to_exclude = nodes_to_exclude or [] if enable_gemv_detection_for_trt and not autotune:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/fp8.py` around lines 219 - 232, The block that calls nodes_to_exclude.extend(...) when enable_gemv_detection_for_trt and not autotune can raise AttributeError if nodes_to_exclude is None; before calling find_nodes_from_matmul_to_exclude and extending, ensure nodes_to_exclude is initialized to a list (e.g., if nodes_to_exclude is None assign an empty list) or guard the extend call by creating a new list and assigning it back to nodes_to_exclude; update the code around enable_gemv_detection_for_trt / autotune, the find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude handling so extend is only called on a list.modelopt/onnx/quantization/int8.py (1)
161-174:⚠️ Potential issue | 🟠 MajorPotential
AttributeErrorifnodes_to_excludeisNone.When
enable_gemv_detection_for_trtisTrueandautotuneisFalse, line 174 callsnodes_to_exclude.extend()beforenodes_to_excludeis validated/converted byfind_nodes_to_exclude()on line 178. Ifnodes_to_excludeis passed asNone, this will raise anAttributeError.🐛 Proposed fix
enable_gemv_detection_for_trt = kwargs.get("enable_gemv_detection_for_trt", True) + nodes_to_exclude = nodes_to_exclude or [] if enable_gemv_detection_for_trt and not autotune:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/int8.py` around lines 161 - 174, The code may call nodes_to_exclude.extend(...) when nodes_to_exclude can be None; ensure nodes_to_exclude is a list before extending: in the block guarded by enable_gemv_detection_for_trt and not autotune, either initialize nodes_to_exclude if None (e.g., nodes_to_exclude = nodes_to_exclude or []) or call find_nodes_to_exclude() earlier and assign/normalize nodes_to_exclude before using extend; update the logic around nodes_to_exclude, find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude to guarantee nodes_to_exclude is always a list when extending.
🧹 Nitpick comments (3)
modelopt/onnx/quantization/quantize.py (1)
272-274: Filename replacement may fail with edge-case paths.Using
onnx_path.replace(".onnx", ".quant_autotune.onnx")could produce unexpected results if ".onnx" appears elsewhere in the path (e.g.,/models/onnx.models/model.onnx).💡 Safer alternative using path manipulation
+ import os # Export model with Q/DQ insertion - onnx_path_autotune = onnx_path.replace(".onnx", ".quant_autotune.onnx") + base, ext = os.path.splitext(onnx_path) + onnx_path_autotune = f"{base}.quant_autotune{ext}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/quantize.py` around lines 272 - 274, The filename construction using onnx_path.replace(".onnx", ".quant_autotune.onnx") can mis-replace when ".onnx" appears elsewhere in the path; change the logic that computes onnx_path_autotune to use proper path/suffix manipulation (e.g., Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before calling autotuner.export_onnx and appending to intermediate_generated_files, updating references to onnx_path_autotune, onnx_path, and the autotuner.export_onnx call accordingly.modelopt/onnx/utils.py (1)
175-191: PotentialIndexErrorif input/output lists contain unexpected elements.The list comprehension assumes
inp.inputs[0]andout.outputs[0]exist wheninp.inputs/out.outputsare truthy. While graphsurgeon typically ensures non-empty lists here, adding explicit length checks would make this more robust.🛡️ Proposed defensive fix
return [ node for node in graph.nodes - if any(inp.inputs[0].op == "DequantizeLinear" for inp in node.inputs if inp.inputs) - or any(out.outputs[0].op == "QuantizeLinear" for out in node.outputs if out.outputs) + if any( + len(inp.inputs) > 0 and inp.inputs[0].op == "DequantizeLinear" + for inp in node.inputs + if inp.inputs + ) + or any( + len(out.outputs) > 0 and out.outputs[0].op == "QuantizeLinear" + for out in node.outputs + if out.outputs + ) ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/utils.py` around lines 175 - 191, The comprehension in get_quantized_nodes assumes inp.inputs[0] and out.outputs[0] exist and can raise IndexError; change the two any() guards to explicitly check length (or truthiness plus index-safe access) before indexing (e.g., ensure len(inp.inputs) > 0 and len(out.outputs) > 0) so you only evaluate inp.inputs[0].op == "DequantizeLinear" and out.outputs[0].op == "QuantizeLinear" when the lists have at least one element; update the generator to use these safe conditions around node.inputs and node.outputs to avoid crashes.modelopt/onnx/quantization/autotune/workflows.py (1)
202-203: Docstring should document temp directory behavior.The docstring for
output_dirdoesn't mention that whenNoneis provided, a temporary directory is automatically created viatempfile.mkdtemp(). This is important for API consumers to understand, especially since temp directories may accumulate ifkeep_output_dir=True(the default).📝 Suggested docstring update
- output_dir: Directory for output files (state, logs, models). Created if it doesn't exist. + output_dir: Directory for output files (state, logs, models). Created if it doesn't exist. + If None, a temporary directory is created via tempfile.mkdtemp().🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/workflows.py` around lines 202 - 203, Update the docstring for the output_dir parameter in the function/class that defines output_dir (the docstring in workflows.py around the autotune workflow) to explicitly state that when output_dir is None a temporary directory is created via tempfile.mkdtemp(), and note that the temporary directory will be retained if keep_output_dir=True (the default), so callers may need to remove it to avoid accumulation; reference the output_dir parameter name and the keep_output_dir flag in the description.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 386-390: The log message in the cleanup branch is inverted: inside
the if not keep_output_dir block (where shutil.rmtree(output_dir) is called)
update the logger.debug message to tell users to set keep_output_dir=True to
retain the directory; specifically modify the message emitted by logger.debug
near the removal call that references output_dir and keep_output_dir so it
correctly reads that setting keep_output_dir=True will keep the directory.
In `@modelopt/onnx/quantization/quantize.py`:
- Around line 246-253: The function _find_nodes_to_quantize_autotune uses a
mutable default for intermediate_generated_files (list[str] = []); change the
signature to use None as the default (intermediate_generated_files:
Optional[list[str]] = None) and inside the function, if
intermediate_generated_files is None then set intermediate_generated_files = []
so each call gets a fresh list; update any type hints/imports if needed and
ensure all code in _find_nodes_to_quantize_autotune that appends or inspects
intermediate_generated_files works with the new initialization.
In `@setup.py`:
- Line 62: The dependency entry for "cuda-python" in setup.py lacks a version
constraint and the inline comment "For autotune" is misleading; change the
dependency to include a minimum version compatible with your CUDA/driver/ONNX
Runtime stack (e.g., "cuda-python>=13.0") and update the comment to accurately
state its purpose (e.g., "CUDA Python bindings for GPU/driver interactions -
ensure matches CUDA/ONNX Runtime version"). Ensure this follows the same pinning
style as other dependencies like "onnxslim>=0.1.76" and "polygraphy>=0.49.22".
---
Outside diff comments:
In `@modelopt/onnx/quantization/fp8.py`:
- Around line 219-232: The block that calls nodes_to_exclude.extend(...) when
enable_gemv_detection_for_trt and not autotune can raise AttributeError if
nodes_to_exclude is None; before calling find_nodes_from_matmul_to_exclude and
extending, ensure nodes_to_exclude is initialized to a list (e.g., if
nodes_to_exclude is None assign an empty list) or guard the extend call by
creating a new list and assigning it back to nodes_to_exclude; update the code
around enable_gemv_detection_for_trt / autotune, the
find_nodes_from_matmul_to_exclude call, and the nodes_to_exclude handling so
extend is only called on a list.
In `@modelopt/onnx/quantization/int8.py`:
- Around line 161-174: The code may call nodes_to_exclude.extend(...) when
nodes_to_exclude can be None; ensure nodes_to_exclude is a list before
extending: in the block guarded by enable_gemv_detection_for_trt and not
autotune, either initialize nodes_to_exclude if None (e.g., nodes_to_exclude =
nodes_to_exclude or []) or call find_nodes_to_exclude() earlier and
assign/normalize nodes_to_exclude before using extend; update the logic around
nodes_to_exclude, find_nodes_from_matmul_to_exclude, and find_nodes_to_exclude
to guarantee nodes_to_exclude is always a list when extending.
In `@tests/unit/onnx/quantization/autotune/test_region.py`:
- Around line 16-21: Remove the duplicated license disclaimer block that was
accidentally copy-pasted; locate the repeated Apache license/disclaimer text
that appears a second time and delete the redundant block so only the original
license header remains at the top of the file (ensure the first license header
is preserved and no other content is altered).
---
Nitpick comments:
In `@modelopt/onnx/quantization/autotune/workflows.py`:
- Around line 202-203: Update the docstring for the output_dir parameter in the
function/class that defines output_dir (the docstring in workflows.py around the
autotune workflow) to explicitly state that when output_dir is None a temporary
directory is created via tempfile.mkdtemp(), and note that the temporary
directory will be retained if keep_output_dir=True (the default), so callers may
need to remove it to avoid accumulation; reference the output_dir parameter name
and the keep_output_dir flag in the description.
In `@modelopt/onnx/quantization/quantize.py`:
- Around line 272-274: The filename construction using
onnx_path.replace(".onnx", ".quant_autotune.onnx") can mis-replace when ".onnx"
appears elsewhere in the path; change the logic that computes onnx_path_autotune
to use proper path/suffix manipulation (e.g.,
Path(onnx_path).with_suffix(".quant_autotune.onnx") or equivalent) before
calling autotuner.export_onnx and appending to intermediate_generated_files,
updating references to onnx_path_autotune, onnx_path, and the
autotuner.export_onnx call accordingly.
In `@modelopt/onnx/utils.py`:
- Around line 175-191: The comprehension in get_quantized_nodes assumes
inp.inputs[0] and out.outputs[0] exist and can raise IndexError; change the two
any() guards to explicitly check length (or truthiness plus index-safe access)
before indexing (e.g., ensure len(inp.inputs) > 0 and len(out.outputs) > 0) so
you only evaluate inp.inputs[0].op == "DequantizeLinear" and out.outputs[0].op
== "QuantizeLinear" when the lists have at least one element; update the
generator to use these safe conditions around node.inputs and node.outputs to
avoid crashes.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📥 Commits
Reviewing files that changed from the base of the PR and between 0f668a3 and 52b9c31eba63d51a6169c4b6718db227ae3ab4a0.
📒 Files selected for processing (11)
modelopt/onnx/op_types.pymodelopt/onnx/quantization/__main__.pymodelopt/onnx/quantization/autotune/workflows.pymodelopt/onnx/quantization/fp8.pymodelopt/onnx/quantization/graph_utils.pymodelopt/onnx/quantization/int8.pymodelopt/onnx/quantization/ort_utils.pymodelopt/onnx/quantization/quantize.pymodelopt/onnx/utils.pysetup.pytests/unit/onnx/quantization/autotune/test_region.py
| @@ -242,6 +243,81 @@ def _preprocess_onnx( | |||
| ) | |||
|
|
|||
|
|
|||
There was a problem hiding this comment.
Overall Assessment
This PR is well-structured and achieves its stated goals of:
- Integrating Auto Q/DQ into the ONNX quantization workflow
- Enabling calibration data to obtain correct scales for Q/DQ nodes
The changes are substantial but well-organized across multiple files. Below are my detailed review comments.
There was a problem hiding this comment.
Pull request overview
Integrates ONNX Auto Q/DQ (TensorRT-driven autotuning) into the existing ONNX quantization workflow so Q/DQ placement can be derived from TensorRT profiling and then calibrated to produce real (non-random) Q/DQ scales.
Changes:
- Added an
--autotuneflag (andautotuneplumbing) to route INT8/FP8 quantization through the Auto Q/DQ placement workflow. - Introduced utilities to detect “quantized nodes” from a Q/DQ-inserted model and used this to drive node selection + ORT configuration tweaks (output quantization for certain producers).
- Updated autotune workflow API to accept in-memory models and optionally auto-manage its output directory.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
tests/unit/onnx/quantization/autotune/test_region.py |
Updates file header metadata. |
setup.py |
Adds cuda-python to ONNX optional dependencies to support TensorRT Python autotune benchmarking. |
modelopt/onnx/utils.py |
Adds get_quantized_nodes() helper for extracting quantized nodes from a Q/DQ graph. |
modelopt/onnx/quantization/quantize.py |
Adds autotune flag, integrates Auto Q/DQ placement, and feeds results into INT8/FP8 quantizers. |
modelopt/onnx/quantization/ort_utils.py |
Extends ORT configuration to optionally allow output quantization for selected op types. |
modelopt/onnx/quantization/int8.py |
Adds autotune plumbing and bypasses some default heuristics when autotune is enabled. |
modelopt/onnx/quantization/graph_utils.py |
Fixes partial-input Q/DQ removal to patch the intended consumer branch (shared Q/DQ case). |
modelopt/onnx/quantization/fp8.py |
Adds autotune plumbing and bypasses some default heuristics when autotune is enabled. |
modelopt/onnx/quantization/autotune/workflows.py |
Allows ModelProto input, optional output_dir, and adds optional output-dir cleanup. |
modelopt/onnx/quantization/__main__.py |
Adds CLI flag --autotune. |
modelopt/onnx/op_types.py |
Adds get_activation_ops() used by autotune integration logic. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
modelopt-bot
left a comment
There was a problem hiding this comment.
Review completed. I've posted several inline comments on specific lines. Overall this is a well-structured PR that successfully integrates Auto Q/DQ into the ONNX quantization workflow. Key highlights include good integration via _find_nodes_to_quantize_autotune, flexible API changes for in-memory models, and an important bug fix for shared Q/DQ pair handling. Please address the inline comments regarding documentation completion, code organization suggestions, and the copyright year consistency. Recommend approving with minor changes.
ab4c5a3 to
c23208f
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #951 +/- ##
==========================================
- Coverage 70.25% 70.11% -0.14%
==========================================
Files 220 221 +1
Lines 25368 25459 +91
==========================================
+ Hits 17822 17851 +29
- Misses 7546 7608 +62 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 8 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c147979 to
0a32bea
Compare
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…eded Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
… silent corruption of the graph. Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
…ize overwrite and other flags (should have the same behavior as pre-autotune) Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
1be0cf2 to
de2ef53
Compare
### What does this PR do? **Type of change**: documentation **Overview**: This PR updates the documentation and does some folder re-structuring and file re-naming related to #951. ### Usage Documentation ### Testing Documentation ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (renamed `AutoQDQ` to `Autotune`) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Renamed AutoQDQ to Autotune across guides and changelog. * Updated Autotune guide descriptions and wording. * Added a new section on optimizing Q/DQ node placement with Autotune, including CLI usage and API links (appears twice in one README). * Applied minor grammar and capitalization corrections. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
### What does this PR do? **Type of change**: documentation **Overview**: This PR updates the documentation and does some folder re-structuring and file re-naming related to NVIDIA#951. ### Usage Documentation ### Testing Documentation ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ - Did you write any new necessary tests?: N/A <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ (renamed `AutoQDQ` to `Autotune`) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Renamed AutoQDQ to Autotune across guides and changelog. * Updated Autotune guide descriptions and wording. * Added a new section on optimizing Q/DQ node placement with Autotune, including CLI usage and API links (appears twice in one README). * Applied minor grammar and capitalization corrections. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Regression bug introduced by the Autotune integration into ModelOpt ONNX quantization (#951), making ModelOpt dependent on TensorRT in all scenarios. This PR fixes this issue by requiring TensorRT only when `--autotune` is enabled. ### Usage ```python $ python -m modelopt.onnx.quantization --onnx_path=${MODEL_NAME}.onnx ``` ### Testing See bug 6056809. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Autotune dependency failures now surface as clearer runtime errors instead of only logging warnings. * **Chores** * Centralized autotune presets and numeric defaults into a shared configuration. * Core autotune components are conditionally exposed so initialization succeeds when optional acceleration libraries are absent. * Deferred autotune imports to runtime to improve failure handling. * **Tests** * Added a test ensuring the quantization CLI/parser initializes correctly without optional acceleration libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Regression bug introduced by the Autotune integration into ModelOpt ONNX quantization (#951), making ModelOpt dependent on TensorRT in all scenarios. This PR fixes this issue by requiring TensorRT only when `--autotune` is enabled. ### Usage ```python $ python -m modelopt.onnx.quantization --onnx_path=${MODEL_NAME}.onnx ``` ### Testing See bug 6056809. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Autotune dependency failures now surface as clearer runtime errors instead of only logging warnings. * **Chores** * Centralized autotune presets and numeric defaults into a shared configuration. * Core autotune components are conditionally exposed so initialization succeeds when optional acceleration libraries are absent. * Deferred autotune imports to runtime to improve failure handling. * **Tests** * Added a test ensuring the quantization CLI/parser initializes correctly without optional acceleration libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
### What does this PR do? Type of change: Bug fix Regression bug introduced by the Autotune integration into ModelOpt ONNX quantization (#951), making ModelOpt dependent on TensorRT in all scenarios. This PR fixes this issue by requiring TensorRT only when `--autotune` is enabled. ### Usage ```python $ python -m modelopt.onnx.quantization --onnx_path=${MODEL_NAME}.onnx ``` ### Testing See bug 6056809. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Autotune dependency failures now surface as clearer runtime errors instead of only logging warnings. * **Chores** * Centralized autotune presets and numeric defaults into a shared configuration. * Core autotune components are conditionally exposed so initialization succeeds when optional acceleration libraries are absent. * Deferred autotune imports to runtime to improve failure handling. * **Tests** * Added a test ensuring the quantization CLI/parser initializes correctly without optional acceleration libraries. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
What does this PR do?
Type of change: New feature
Overview: ONNX Autotune (also called Auto Q/DQ) is currently and standalone feature of ModelOpt that automatically adds Q/DQ where relevant according to information obtained from TensorRT inference. One issue is that the scales in those Q/DQ nodes are random.
This PR does 2 major things:
Usage
Testing
Added unittest for Q/DQ node placement validation:
tests/gpu/onnx/quantization/test_autotune_quantization_integration.pyVerified that accuracy was recovered by integrating MOQ with Autotune. Results on RTX 3090 with TRT 10.12.0.36 (
--stronglyTyped) with ViT, as perexamples/onnx_ptq:Notice that accuracy was mostly recovered from standalone Autotune to MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and MOQ + Autotune is likely due to some sensitive nodes being quantized, such as
BiasAdd(see bug 5916898).Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Additional information
To reproduce accuracy with ViT, call
download_example_onnx.pyandimage_prep.pywithout--fp16.If
--fp16is used here, quantizing this model with--autotuneresults in the following error:This is fixed in #978.