[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
Open
Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Open
[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request: Expand ONNX Q/DQ Autotuning with Subgraph Mode
Branch:
dev-wahao-autotune-subgraph-profile→mainType: Feature
Summary
This PR expands on the existing ONNX Q/DQ Autotune framework by adding a subgraph workflow (
--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement optimization.It also integrates @willg-nv's
TorchRegionBuilder(from Draft PR #963) for PyTorch-style hierarchical region discovery, and adds TensorRT Python API benchmarking utilities, documentation, and examples.Relation to existing PRs:
TorchRegionBuilderis included here along with the subgraph workflow.What's New
--workflow subgraph: fusion-aware grouping from TensorRTgraph.json; per-subgraph QDQ scheme profiling; optional per-layer timing with fallback to total latency.fusion_grouping.py: parse TRTgraph.json, build fusion groups, infer shapes. Auto-generatesgraph.jsonviatrtexecFP16 build if not provided.subgraph_extractor.py: extract standalone ONNX subgraphs per fusion group for isolated benchmarking.torch_region_builder.py: PyTorch-style hierarchical region discovery using node name conventions (from #963).tensorrt_utils.py: TRT Python API benchmark with timing cache, plugin support, and configurable warmup/timing runs.optimized_raw.onnx+optimized_final.onnx.autotune_cache.jsonfor Phase 2 (subgraph profiling) and Phase 3 (incremental validation).--exportProfile/--profilingVerbosityand retry with total latency.--workflow {region,subgraph},--graph_json,--incremental_validation/--no-incremental-validation.examples/qdq_placement/: README (Quick Start, region vs subgraph, best practices) andset_batch_size.py.Key Files
modelopt/onnx/quantization/autotune/__main__.py--workflow,--graph-json,--incremental-validation,--use-trtexec,--trtexec-args, etc.modelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/fusion_grouping.pygraph.json, create fusion groups,generate_graph_json()(trtexec FP16 build when no graph is provided).modelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pymodelopt/onnx/quantization/autotune/torch_region_builder.pymodelopt/onnx/quantization/autotune/benchmark.pyexport_profile_path, profiling-flag dedup and "Unknown option" retry.modelopt/onnx/quantization/autotune/workflows.pybenchmark_onnx_model(); passes throughexport_profile_pathwhen using trtexec.modelopt/onnx/quantization/autotune/qdq_utils.pyexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pytests/unit/onnx/quantization/autotune/test_config.pyHow to Test
Region mode (default):
Subgraph mode with trtexec (FP8, optional graph.json):
Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from
autotune_cache.json.Checklist
mainwith DCO sign-off--use-trtexecverifiedexamples/qdq_placement/README.mdmatches behaviorDocumentation
examples/qdq_placement/README.md— Quick Start, subgraph best practices, output layout, optional graph generation.docs/source/guides/9_qdq_placement.rstanddocs/source/reference/2_qdq_placement.rstalign with the CLI and behavior above.Notes
--exportProfile/--profilingVerbosityare handled by retrying without those flags and using total latency for scheme selection.Summary by CodeRabbit
New Features
Utilities
Documentation
Tests