Skip to content

[5951713] Fix benchmark allocation failure#978

Merged
gcunhase merged 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-fix-benchmark-bug
Mar 6, 2026
Merged

[5951713] Fix benchmark allocation failure#978
gcunhase merged 2 commits intoNVIDIA:mainfrom
willg-nv:dev-willg-fix-benchmark-bug

Conversation

@willg-nv
Copy link
Copy Markdown
Contributor

@willg-nv willg-nv commented Mar 5, 2026

What does this PR do?

[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type
Traceback (most recent call last):
...
    raise NotImplementedError(
NotImplementedError: Converting dtype('float16') to a ctypes type

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, using torch.load(..., weights_only=True), avoiding pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other source, did you follow IP policy in CONTRIBUTING.md?: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved dtype handling robustness in host memory allocation to avoid failures for uncommon numeric types.
    • Added fallback support for 2-byte floating-point formats (float16, bfloat16); clearer errors now raised when a dtype is unsupported.

Signed-off-by: Will Guo <willg@nvidia.com>
@willg-nv willg-nv requested a review from a team as a code owner March 5, 2026 02:58
@willg-nv willg-nv requested a review from cjluo-nv March 5, 2026 02:58
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3287b5fa-98e6-430e-a6d8-68244fbe7a7d

📥 Commits

Reviewing files that changed from the base of the PR and between 06ce30b and 7345b99.

📒 Files selected for processing (1)
  • modelopt/onnx/quantization/autotune/benchmark.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt/onnx/quantization/autotune/benchmark.py

📝 Walkthrough

Walkthrough

Normalize dtype to np.dtype in _alloc_pinned_host, attempt a ctypes-based view of allocated host memory, fall back to allocating as uint16 and viewing as the requested 2-byte dtype when ctypes mapping is missing, and raise TypeError if neither is possible; return path unchanged on success.

Changes

Cohort / File(s) Summary
Dtype / Memory-allocation logic
modelopt/onnx/quantization/autotune/benchmark.py
Explicitly convert dtype to np.dtype before computing itemsize; try to create a ctypes-based view of allocated host memory; if no ctypes mapping for 2-byte float types, allocate as uint16 and view as requested dtype; raise TypeError when unsupported. Highlights: ctypes fallback for 2-byte types, unchanged success/error return behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly identifies the specific issue being fixed (benchmark allocation failure) and matches the core change in the changeset (fixing NotImplementedError in memory allocation).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed The pull request introduces no new security anti-patterns. Modifications to benchmark.py are limited to the _alloc_pinned_host function, adding dtype normalization and ctypes-based fallback for 2-byte float types, with no involvement of critical security patterns.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@willg-nv
Copy link
Copy Markdown
Contributor Author

willg-nv commented Mar 5, 2026

@gcunhase please help review this PR.

@gcunhase gcunhase changed the title Fix benchmark allocation failure [5951713] Fix benchmark allocation failure Mar 5, 2026
Comment thread modelopt/onnx/quantization/autotune/benchmark.py Outdated
@gcunhase
Copy link
Copy Markdown
Contributor

gcunhase commented Mar 5, 2026

/ok to test 06ce30b

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 0% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.08%. Comparing base (a34d613) to head (7345b99).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/onnx/quantization/autotune/benchmark.py 0.00% 10 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #978      +/-   ##
==========================================
- Coverage   72.10%   72.08%   -0.02%     
==========================================
  Files         209      209              
  Lines       23628    23638      +10     
==========================================
+ Hits        17036    17040       +4     
- Misses       6592     6598       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Will Guo <willg@nvidia.com>
@gcunhase
Copy link
Copy Markdown
Contributor

gcunhase commented Mar 6, 2026

/ok to test 7345b99

@gcunhase gcunhase enabled auto-merge (squash) March 6, 2026 19:40
@gcunhase gcunhase merged commit be6dfad into NVIDIA:main Mar 6, 2026
40 checks passed
gcunhase added a commit that referenced this pull request Mar 9, 2026
#998)

### What does this PR do?

**Type of change**: Bug fix

**Overview**: Replace CUDA memory management from CUDART to PyTorch
(higher-level API).

### Usage

```python
# Add a code snippet demonstrating how to use this
```

### Testing
1. Added unittests.
2. Tested that this PR does not break
#951 or
#978

### Before your PR is "*Ready for review*"

Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)
and your commits are signed (`git commit -s -S`).

Make sure you read and follow the [Security Best
Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors)
(e.g. avoiding hardcoded `trust_remote_code=True`, using
`torch.load(..., weights_only=True)`, avoiding `pickle`, etc.).

- Is this change backward compatible?: ✅
- If you copied code from any other source, did you follow IP policy in
[CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?:
N/A <!--- Mandatory -->
- Did you write any new necessary tests?: ✅ <!--- Mandatory for new
features or examples. -->
- Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?:
N/A <!--- Only for new features, API changes, critical bug fixes or
backward incompatible changes. -->

### Additional Information

Summary of changes in `benchmark.py — TensorRTPyBenchmark`:
 | What changed | Before | After |
|---|---|---|
| Imports | `contextlib` + `from cuda.bindings import runtime as cudart`
| `import torch` (conditional) |
| Availability flag | `CUDART_AVAILABLE` | `TORCH_CUDA_AVAILABLE =
torch.cuda.is_available()` |
| `__init__` guard | checks `CUDART_AVAILABLE or cudart is None` |
checks `TORCH_CUDA_AVAILABLE` |
| `_alloc_pinned_host` | `cudaMallocHost` + ctypes address hack, returns
`(ptr, arr, err)` | `torch.empty(...).pin_memory()`, returns `(tensor,
tensor.numpy())` |
| `_free_buffers` | `cudaFreeHost` + `cudaFree` per buffer |
`bufs.clear()` — PyTorch GC handles deallocation |
| `_allocate_buffers` | raw `device_ptr` integers, error-code returns |
`torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address |
| `_run_warmup` | `cudaMemcpyAsync` + `cudaStreamSynchronize` |
`tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()` |
| `_run_timing` | same cudart pattern | same torch pattern |
| `run` — stream lifecycle | `cudaStreamCreate()` /
`cudaStreamDestroy()` | `torch.cuda.Stream()` / `del stream` |
| `run` — stream arg to TRT | raw integer handle | `stream.cuda_stream`
(integer property) |
| Error handling | `cudaError_t` return codes | PyTorch raises
`RuntimeError`, caught by existing `except Exception` |

Related to #961

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* TensorRT benchmarking migrated from direct CUDA runtime calls to
PyTorch CUDA tensors, pinned memory, and CUDA stream primitives —
simplifying buffer management, transfers, and timing semantics.
* **Tests**
* Expanded GPU autotune benchmark tests with broader unit and
integration coverage for CUDA/TensorRT paths, pinned-host/device
buffering, stream behavior, warmup/timing, and end-to-end latency
scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
gcunhase added a commit that referenced this pull request Mar 11, 2026
## What does this PR do?

**Type of change:** New feature

**Overview:** ONNX Autotune (also called Auto Q/DQ) is currently and
standalone feature of ModelOpt that automatically adds Q/DQ where
relevant according to information obtained from TensorRT inference. One
issue is that the scales in those Q/DQ nodes are random.

This PR does 2 major things:
1. Integrates Auto Q/DQ into the ONNX quantization workflow; and
2. Enables calibration data to be used to obtain the correct scales for
the Q/DQ nodes.

## Usage

```python
$ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive}
```
> Please see `__main__.py` for other args.

## Testing
1. Added unittest for Q/DQ node placement validation:
`tests/gpu/onnx/quantization/test_autotune_quantization_integration.py`

2. Verified that accuracy was recovered by integrating MOQ with
Autotune. Results on RTX 3090 with TRT 10.12.0.36 (`--stronglyTyped`)
with ViT, as per `examples/onnx_ptq`:

| Model                    | Top-1 acc | Top-5 acc |
|--------------------------|---------------|----------------|
| FP32                       | 85.1% | 97.5% |
| FP16 (FP32 with --fp16) | 85.1% | 97.5% |
| Quant (MOQ)                      | 82.4% | 96.4% |
| Quant (Autotune)              | 0.1% | 0.5%|
| Quant (MOQ + Autotune) | 79.6% | 95.0% |

Notice that accuracy was mostly recovered from standalone Autotune to
MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and
MOQ + Autotune is likely due to some sensitive nodes being quantized,
such as `BiasAdd` (see bug 5916898).

## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->

- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: Yes
- **Did you add or update any necessary documentation?**: No (will be
done in a different PR)
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Autotuning added to ONNX quantization: CLI flags, presets, per-region
tuning, and FP8/INT8 support; accepts in-memory models and optional
output dirs; node-filter loading and explicit-flag CLI behavior.
* Activation-operation accessor exposed and autotune helpers added to
the package API.

* **Bug Fixes**
* Safer graph rewiring to avoid corrupting quantized graphs when targets
are absent.

* **Tests**
* New integration test and model helper validating autotune quantization
consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

## Additional information
To reproduce accuracy with ViT, call `download_example_onnx.py` and
`image_prep.py` without `--fp16`.

If `--fp16` is used here, quantizing this model with `--autotune`
results in the following error:
```
[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type
```
This is fixed in #978.

---------

Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants