[5951713] Fix benchmark allocation failure#978
Conversation
Signed-off-by: Will Guo <willg@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughNormalize dtype to Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@gcunhase please help review this PR. |
|
/ok to test 06ce30b |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #978 +/- ##
==========================================
- Coverage 72.10% 72.08% -0.02%
==========================================
Files 209 209
Lines 23628 23638 +10
==========================================
+ Hits 17036 17040 +4
- Misses 6592 6598 +6 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Will Guo <willg@nvidia.com>
|
/ok to test 7345b99 |
#998) ### What does this PR do? **Type of change**: Bug fix **Overview**: Replace CUDA memory management from CUDART to PyTorch (higher-level API). ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing 1. Added unittests. 2. Tested that this PR does not break #951 or #978 ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, using `torch.load(..., weights_only=True)`, avoiding `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other source, did you follow IP policy in [CONTRIBUTING.md](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md#-copying-code-from-other-sources)?: N/A <!--- Mandatory --> - Did you write any new necessary tests?: ✅ <!--- Mandatory for new features or examples. --> - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A <!--- Only for new features, API changes, critical bug fixes or backward incompatible changes. --> ### Additional Information Summary of changes in `benchmark.py — TensorRTPyBenchmark`: | What changed | Before | After | |---|---|---| | Imports | `contextlib` + `from cuda.bindings import runtime as cudart` | `import torch` (conditional) | | Availability flag | `CUDART_AVAILABLE` | `TORCH_CUDA_AVAILABLE = torch.cuda.is_available()` | | `__init__` guard | checks `CUDART_AVAILABLE or cudart is None` | checks `TORCH_CUDA_AVAILABLE` | | `_alloc_pinned_host` | `cudaMallocHost` + ctypes address hack, returns `(ptr, arr, err)` | `torch.empty(...).pin_memory()`, returns `(tensor, tensor.numpy())` | | `_free_buffers` | `cudaFreeHost` + `cudaFree` per buffer | `bufs.clear()` — PyTorch GC handles deallocation | | `_allocate_buffers` | raw `device_ptr` integers, error-code returns | `torch.empty(..., device="cuda")`, `tensor.data_ptr()` for TRT address | | `_run_warmup` | `cudaMemcpyAsync` + `cudaStreamSynchronize` | `tensor.copy_(non_blocking=True)` inside `torch.cuda.stream()` | | `_run_timing` | same cudart pattern | same torch pattern | | `run` — stream lifecycle | `cudaStreamCreate()` / `cudaStreamDestroy()` | `torch.cuda.Stream()` / `del stream` | | `run` — stream arg to TRT | raw integer handle | `stream.cuda_stream` (integer property) | | Error handling | `cudaError_t` return codes | PyTorch raises `RuntimeError`, caught by existing `except Exception` | Related to #961 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * TensorRT benchmarking migrated from direct CUDA runtime calls to PyTorch CUDA tensors, pinned memory, and CUDA stream primitives — simplifying buffer management, transfers, and timing semantics. * **Tests** * Expanded GPU autotune benchmark tests with broader unit and integration coverage for CUDA/TensorRT paths, pinned-host/device buffering, stream behavior, warmup/timing, and end-to-end latency scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
## What does this PR do?
**Type of change:** New feature
**Overview:** ONNX Autotune (also called Auto Q/DQ) is currently and
standalone feature of ModelOpt that automatically adds Q/DQ where
relevant according to information obtained from TensorRT inference. One
issue is that the scales in those Q/DQ nodes are random.
This PR does 2 major things:
1. Integrates Auto Q/DQ into the ONNX quantization workflow; and
2. Enables calibration data to be used to obtain the correct scales for
the Q/DQ nodes.
## Usage
```python
$ python -m modelopt.onnx.quantization --onnx_path=model.onnx --autotune={quick,default,extensive}
```
> Please see `__main__.py` for other args.
## Testing
1. Added unittest for Q/DQ node placement validation:
`tests/gpu/onnx/quantization/test_autotune_quantization_integration.py`
2. Verified that accuracy was recovered by integrating MOQ with
Autotune. Results on RTX 3090 with TRT 10.12.0.36 (`--stronglyTyped`)
with ViT, as per `examples/onnx_ptq`:
| Model | Top-1 acc | Top-5 acc |
|--------------------------|---------------|----------------|
| FP32 | 85.1% | 97.5% |
| FP16 (FP32 with --fp16) | 85.1% | 97.5% |
| Quant (MOQ) | 82.4% | 96.4% |
| Quant (Autotune) | 0.1% | 0.5%|
| Quant (MOQ + Autotune) | 79.6% | 95.0% |
Notice that accuracy was mostly recovered from standalone Autotune to
MOQ + Autotune (real Q/DQ scales). The drop in accuracy between MOQ and
MOQ + Autotune is likely due to some sensitive nodes being quantized,
such as `BiasAdd` (see bug 5916898).
## Before your PR is "*Ready for review*"
<!-- If you haven't finished some of the above items you can still open
`Draft` PR. -->
- **Make sure you read and follow [Contributor
guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md)**
and your commits are signed.
- **Is this change backward compatible?**: Yes <!--- If No, explain why.
-->
- **Did you write any new necessary tests?**: Yes
- **Did you add or update any necessary documentation?**: No (will be
done in a different PR)
- **Did you update
[Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?**:
No <!--- Only for new features, API changes, critical bug fixes or bw
breaking changes. -->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Autotuning added to ONNX quantization: CLI flags, presets, per-region
tuning, and FP8/INT8 support; accepts in-memory models and optional
output dirs; node-filter loading and explicit-flag CLI behavior.
* Activation-operation accessor exposed and autotune helpers added to
the package API.
* **Bug Fixes**
* Safer graph rewiring to avoid corrupting quantized graphs when targets
are absent.
* **Tests**
* New integration test and model helper validating autotune quantization
consistency.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
## Additional information
To reproduce accuracy with ViT, call `download_example_onnx.py` and
`image_prep.py` without `--fp16`.
If `--fp16` is used here, quantizing this model with `--autotune`
results in the following error:
```
[modelopt][onnx] - ERROR - Benchmark failed: Converting dtype('float16') to a ctypes type
```
This is fixed in #978.
---------
Signed-off-by: gcunhase <4861122+gcunhase@users.noreply.github.com>
What does this PR do?
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True, usingtorch.load(..., weights_only=True), avoidingpickle, etc.).Additional Information
Summary by CodeRabbit
Release Notes