[5951713] Fix benchmark allocation failure by willg-nv · Pull Request #978 · NVIDIA/Model-Optimizer

willg-nv · 2026-03-05T02:58:22Z

What does this PR do?

[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type
Traceback (most recent call last):
...
    raise NotImplementedError(
NotImplementedError: Converting dtype('float16') to a ctypes type

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved dtype handling robustness in host memory allocation to avoid failures for uncommon numeric types.
- Added fallback support for 2-byte floating-point formats (float16, bfloat16); clearer errors now raised when a dtype is unsupported.

Signed-off-by: Will Guo <willg@nvidia.com>

copy-pr-bot · 2026-03-05T02:58:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-03-05T02:58:36Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3287b5fa-98e6-430e-a6d8-68244fbe7a7d

📥 Commits

Reviewing files that changed from the base of the PR and between 06ce30b and 7345b99.

📒 Files selected for processing (1)

modelopt/onnx/quantization/autotune/benchmark.py

🚧 Files skipped from review as they are similar to previous changes (1)

modelopt/onnx/quantization/autotune/benchmark.py

📝 Walkthrough

Walkthrough

Normalize dtype to np.dtype in _alloc_pinned_host, attempt a ctypes-based view of allocated host memory, fall back to allocating as uint16 and viewing as the requested 2-byte dtype when ctypes mapping is missing, and raise TypeError if neither is possible; return path unchanged on success.

Changes

Cohort / File(s)	Summary
Dtype / Memory-allocation logic `modelopt/onnx/quantization/autotune/benchmark.py`	Explicitly convert `dtype` to `np.dtype` before computing `itemsize`; try to create a ctypes-based view of allocated host memory; if no ctypes mapping for 2-byte float types, allocate as `uint16` and view as requested dtype; raise `TypeError` when unsupported. Highlights: ctypes fallback for 2-byte types, unchanged success/error return behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly identifies the specific issue being fixed (benchmark allocation failure) and matches the core change in the changeset (fixing NotImplementedError in memory allocation).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns	✅ Passed	The pull request introduces no new security anti-patterns. Modifications to benchmark.py are limited to the _alloc_pinned_host function, adding dtype normalization and ctypes-based fallback for 2-byte float types, with no involvement of critical security patterns.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

willg-nv · 2026-03-05T02:58:41Z

@gcunhase please help review this PR.

gcunhase · 2026-03-05T14:58:06Z

/ok to test 06ce30b

codecov · 2026-03-05T17:35:45Z

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.08%. Comparing base (a34d613) to head (7345b99).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/onnx/quantization/autotune/benchmark.py	0.00%	10 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #978      +/-   ##
==========================================
- Coverage   72.10%   72.08%   -0.02%     
==========================================
  Files         209      209              
  Lines       23628    23638      +10     
==========================================
+ Hits        17036    17040       +4     
- Misses       6592     6598       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Will Guo <willg@nvidia.com>

gcunhase · 2026-03-06T19:40:12Z

/ok to test 7345b99

#998) ### What does this PR do? **Type of change**: Bug fix **Overview**: Replace CUDA memory management from CUDART to PyTorch (higher-level API). ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing 1. Added unittests. 2. Tested that this PR does not break #951 or #978 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: N/A  - Did you write any new necessary tests?: ✅  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A  ### Additional Information Summary of changes in `benchmark.py — TensorRTPyBenchmark`: | What changed | Before | After | |---|---|---| | Imports | `contextlib` + `from cuda.bindings import runtime as cudart` | `import torch` (conditional) | | Availability flag | `CUDART_AVAILABLE` | `TORCH_CUDA_AVAILABLE = torch.cuda.is_available()` | | `__init__` guard | checks `CUDART_AVAILABLE or cudart is None` | checks `TORCH_CUDA_AVAILABLE` | | `_alloc_pinned_host` | `cudaMallocHost` + ctypes address hack, returns `(ptr, arr, err)` | `torch.empty(...).pin_memory()`, returns `(tensor, tensor.numpy())` | | `_free_buffers` | `cudaFreeHost` + `cudaFree` per buffer | `bufs.clear()` — PyTorch GC handles deallocation | | `_allocate_buffers` | raw `device_ptr` integers, error-code returns | `torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address | | `_run_warmup` | `cudaMemcpyAsync` + `cudaStreamSynchronize` | `tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()` | | `_run_timing` | same cudart pattern | same torch pattern | | `run` — stream lifecycle | `cudaStreamCreate()` / `cudaStreamDestroy()` | `torch.cuda.Stream()` / `del stream` | | `run` — stream arg to TRT | raw integer handle | `stream.cuda_stream` (integer property) | | Error handling | `cudaError_t` return codes | PyTorch raises `RuntimeError`, caught by existing `except Exception` | Related to #961  ## Summary by CodeRabbit * **Refactor** * TensorRT benchmarking migrated from direct CUDA runtime calls to PyTorch CUDA tensors, pinned memory, and CUDA stream primitives — simplifying buffer management, transfers, and timing semantics. * **Tests** * Expanded GPU autotune benchmark tests with broader unit and integration coverage for CUDA/TensorRT paths, pinned-host/device buffering, stream behavior, warmup/timing, and end-to-end latency scenarios.  --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

## What does this PR do? **Type of change:** New feature **Overview:** ONNX Autotune (also called Auto Q/DQ) is currently and standalone feature of ModelOpt that automatically adds Q/DQ where relevant according to information obtained from TensorRT inference. One issue is that the scales in those Q/DQ nodes are random. This PR does 2 major things: 1. Integrates Auto Q/DQ into the ONNX quantization workflow; and 2. Enables calibration data to be used to obtain the correct scales for the Q/DQ nodes. ## Usage ```python $ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive} ``` > Please see `__main__.py` for other args. ## Testing 1. Added unittest for Q/DQ node placement validation: `tests/gpu/onnx/quantization/test_autotune_quantization_integration.py` 2. Verified that accuracy was recovered by integrating MOQ with Autotune. Results on RTX 3090 with TRT 10.12.0.36 (`--stronglyTyped`) with ViT, as per `examples/onnx_ptq`: | Model | Top-1 acc | Top-5 acc | |--------------------------|---------------|----------------| | FP32 | 85.1% | 97.5% | | FP16 (FP32 with --fp16) | 85.1% | 97.5% | | Quant (MOQ) | 82.4% | 96.4% | | Quant (Autotune) | 0.1% | 0.5%| | Quant (MOQ + Autotune) | 79.6% | 95.0% | Notice that accuracy was mostly recovered from standalone Autotune to MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and MOQ + Autotune is likely due to some sensitive nodes being quantized, such as `BiasAdd` (see bug 5916898). ## Before your PR is "*Ready for review*"  - **Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)** and your commits are signed. - **Is this change backward compatible?**: Yes  - **Did you write any new necessary tests?**: Yes - **Did you add or update any necessary documentation?**: No (will be done in a different PR) - **Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**: No   ## Summary by CodeRabbit * **New Features** * Autotuning added to ONNX quantization: CLI flags, presets, per-region tuning, and FP8/INT8 support; accepts in-memory models and optional output dirs; node-filter loading and explicit-flag CLI behavior. * Activation-operation accessor exposed and autotune helpers added to the package API. * **Bug Fixes** * Safer graph rewiring to avoid corrupting quantized graphs when targets are absent. * **Tests** * New integration test and model helper validating autotune quantization consistency.  ## Additional information To reproduce accuracy with ViT, call `download_example_onnx.py` and `image_prep.py` without `--fp16`. If `--fp16` is used here, quantizing this model with `--autotune` results in the following error: ``` [modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type ``` This is fixed in #978. --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>

Fix benchmark allocation failure

06ce30b

Signed-off-by: Will Guo <willg@nvidia.com>

willg-nv requested a review from a team as a code owner March 5, 2026 02:58

willg-nv requested a review from cjluo-nv March 5, 2026 02:58

gcunhase changed the title ~~Fix benchmark allocation failure~~ [5951713] Fix benchmark allocation failure Mar 5, 2026

gcunhase reviewed Mar 5, 2026

View reviewed changes

Comment thread modelopt/onnx/quantization/autotune/benchmark.py Outdated

gcunhase mentioned this pull request Mar 5, 2026

[OMNIML-3252][ONNX] Add real Q/DQ scales in Autotune #951

Merged

resolve comments

7345b99

Signed-off-by: Will Guo <willg@nvidia.com>

gcunhase approved these changes Mar 6, 2026

View reviewed changes

gcunhase enabled auto-merge (squash) March 6, 2026 19:40

gcunhase merged commit be6dfad into NVIDIA:main Mar 6, 2026
40 checks passed

gcunhase mentioned this pull request Mar 6, 2026

[ONNX][Autotune] Replace CUDA memory management from CUDART to PyTorch #998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5951713] Fix benchmark allocation failure#978

[5951713] Fix benchmark allocation failure#978
gcunhase merged 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-fix-benchmark-bug

willg-nv commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

willg-nv commented Mar 5, 2026

Uh oh!

Uh oh!

gcunhase commented Mar 5, 2026

Uh oh!

codecov Bot commented Mar 5, 2026 •

edited

Loading

Uh oh!

gcunhase commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willg-nv commented Mar 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Mar 5, 2026

Uh oh!

coderabbitai Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

willg-nv commented Mar 5, 2026

Uh oh!

Uh oh!

gcunhase commented Mar 5, 2026

Uh oh!

codecov Bot commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gcunhase commented Mar 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willg-nv commented Mar 5, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 5, 2026 •

edited

Loading

codecov Bot commented Mar 5, 2026 •

edited

Loading