Skip to content

[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015

Open
Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile
Open

[WIP][Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
Hale423 wants to merge 5 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile

Conversation

@Hale423
Copy link
Copy Markdown

@Hale423 Hale423 commented Mar 10, 2026

Pull Request: Expand ONNX Q/DQ Autotuning with Subgraph Mode

Branch: dev-wahao-autotune-subgraph-profilemain
Type: Feature


Summary

This PR expands on the existing ONNX Q/DQ Autotune framework by adding a subgraph workflow (--workflow subgraph) that uses TensorRT fusion boundaries for faster, fusion-aware QDQ placement optimization.

It also integrates @willg-nv's TorchRegionBuilder (from Draft PR #963) for PyTorch-style hierarchical region discovery, and adds TensorRT Python API benchmarking utilities, documentation, and examples.

Relation to existing PRs:


What's New

Area Description
Subgraph workflow --workflow subgraph: fusion-aware grouping from TensorRT graph.json; per-subgraph QDQ scheme profiling; optional per-layer timing with fallback to total latency.
Fusion grouping fusion_grouping.py: parse TRT graph.json, build fusion groups, infer shapes. Auto-generates graph.json via trtexec FP16 build if not provided.
Subgraph extraction subgraph_extractor.py: extract standalone ONNX subgraphs per fusion group for isolated benchmarking.
Torch region builder torch_region_builder.py: PyTorch-style hierarchical region discovery using node name conventions (from #963).
TensorRT utils tensorrt_utils.py: TRT Python API benchmark with timing cache, plugin support, and configurable warmup/timing runs.
Incremental validation Per-group full-model validation: apply QDQ groups one-by-one, keep only those that improve latency. Saves optimized_raw.onnx + optimized_final.onnx.
Cache / resume autotune_cache.json for Phase 2 (subgraph profiling) and Phase 3 (incremental validation).
trtexec compatibility Profiling-flag retry: on "Unknown option", strip --exportProfile/--profilingVerbosity and retry with total latency.
CLI --workflow {region,subgraph}, --graph_json, --incremental_validation / --no-incremental-validation.
Example examples/qdq_placement/: README (Quick Start, region vs subgraph, best practices) and set_batch_size.py.

Key Files

Path Role
modelopt/onnx/quantization/autotune/__main__.py CLI: --workflow, --graph-json, --incremental-validation, --use-trtexec, --trtexec-args, etc.
modelopt/onnx/quantization/autotune/subgraph_workflow.py Subgraph pipeline: Phase 1 (fusion grouping), Phase 2 (subgraph profiling), Phase 3 (full-model + incremental validation), cache I/O.
modelopt/onnx/quantization/autotune/fusion_grouping.py Parse graph.json, create fusion groups, generate_graph_json() (trtexec FP16 build when no graph is provided).
modelopt/onnx/quantization/autotune/subgraph_extractor.py Extract subgraph ONNX from full model given group inputs/outputs and shapes.
modelopt/onnx/quantization/autotune/tensorrt_utils.py TRT Python API benchmark runner with timing cache, plugin support, and dynamic shape handling.
modelopt/onnx/quantization/autotune/torch_region_builder.py PyTorch-style hierarchical region discovery for region mode.
modelopt/onnx/quantization/autotune/benchmark.py trtexec benchmark runner: optional export_profile_path, profiling-flag dedup and "Unknown option" retry.
modelopt/onnx/quantization/autotune/workflows.py Dispatcher and benchmark_onnx_model(); passes through export_profile_path when using trtexec.
modelopt/onnx/quantization/autotune/qdq_utils.py Quantized tensor discovery helpers.
examples/qdq_placement/README.md User-facing example: prerequisites, Quick Start (region + subgraph), output layout, subgraph best practices.
examples/qdq_placement/set_batch_size.py ResNet50 fixed-batch script for reproducible benchmarking.
tests/unit/onnx/quantization/autotune/test_config.py Config class unit tests.

How to Test

Region mode (default):

cd examples/qdq_placement
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
python3 -m modelopt.onnx.quantization.autotune --model resnet50.bs128.onnx --output ./resnet50_results --quant-type int8 --schemes-per-region 20
# Expect: ./resnet50_results/optimized_final.onnx and logs under ./resnet50_results/logs/

Subgraph mode with trtexec (FP8, optional graph.json):

python3 -m modelopt.onnx.quantization.autotune \
  --model resnet50.bs128.onnx \
  --output ./resnet50_subgraph \
  --workflow subgraph \
  --quant-type fp8 \
  --use-trtexec \
  --warmup-runs 5 \
  --timing-runs 20 \
  --incremental-validation \
  --trtexec-args "--stronglyTyped" \
  --schemes-per-region 30
# If --graph-json is omitted, first run will trigger trtexec to generate graph.json under output dir.
# Expect: optimized_raw.onnx, optimized_final.onnx, autotune_cache.json, logs/, subgraphs/

Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from autotune_cache.json.


Checklist

  • Rebased onto latest main with DCO sign-off
  • All CodeRabbit review comments addressed (license headers, mutable defaults, bug fixes)
  • CI / unit tests pass
  • Region mode end-to-end verified
  • Subgraph mode end-to-end with --use-trtexec verified
  • Interrupted subgraph run resumes after re-run
  • examples/qdq_placement/README.md matches behavior

Documentation

  • Example: examples/qdq_placement/README.md — Quick Start, subgraph best practices, output layout, optional graph generation.
  • Guides / API: docs/source/guides/9_qdq_placement.rst and docs/source/reference/2_qdq_placement.rst align with the CLI and behavior above.

Notes

Summary by CodeRabbit

  • New Features

    • Automated Q/DQ placement optimizer: region and new subgraph workflows, fusion-aware grouping, per-subgraph heuristic schemes, incremental validation, resume/crash recovery, and calibration support.
    • Expanded CLI: workflow selection, calibration options, incremental validation, and profiling/export options.
  • Utilities

    • Batch-size fixer, subgraph extractor, fusion/region inspection, and TensorRT benchmarking (CLI & Python) with timing-cache and profiling fallbacks.
  • Documentation

    • Comprehensive guides, reference docs, examples, and quick-starts for autotuning and deployment.
  • Tests

    • Extensive unit tests for workflows, utilities, and tooling.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants