Add CUDA plugin EP Python package pipeline#28299
Conversation
Co-authored-by: Copilot <copilot@github.com>
…lwu/fix_cuda_plugin_package_pipeline # Conflicts: # tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml # tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml # tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml # tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml # tools/ci_build/github/azure-pipelines/templates/set-plugin-build-variables-step.yml
009ae2d to
189853a
Compare
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds CI and packaging infrastructure to build and publish CUDA Plugin EP Python wheels (mirroring the existing WebGPU plugin EP packaging approach).
Changes:
- Introduces
plugin-ep-cuda/pythonsources and abuild_wheel.pyscript to produce platform wheels containing the CUDA plugin EP binary. - Extends Linux/Windows CUDA plugin Azure pipeline stages to build and publish Python wheel artifacts.
- Updates manylinux CUDA Docker build to support configurable
PIP_INDEX_URL.
Reviewed changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda | Adds PIP_INDEX_URL build arg handling for dependency installation. |
| tools/ci_build/github/linux/build_cuda_plugin_python_package.sh | New helper script to build the CUDA plugin EP wheel inside the manylinux CUDA image. |
| tools/ci_build/github/linux/build_cuda_plugin_package.sh | Passes PIP_INDEX_URL through to container builds. |
| tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml | Adds Windows wheel build job and adjusts CUDA SDK acquisition / PATH behavior. |
| tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml | Switches default pool to Ubuntu 24.04 and adds Linux wheel build job. |
| tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml | Plumbs python_package_name through the packaging stage template. |
| tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml | Specifies CUDA-family-specific Python distribution names. |
| plugin-ep-cuda/python/setup.py | Produces py3-none-{platform} wheels and marks distribution as non-pure. |
| plugin-ep-cuda/python/requirements-build-wheel.txt | Adds wheel build tool dependencies (incl. auditwheel/patchelf on Linux). |
| plugin-ep-cuda/python/pyproject.toml.in | Defines templated project metadata and package data patterns for binaries. |
| plugin-ep-cuda/python/onnxruntime_ep_cuda/init.py | Adds helper APIs to locate the plugin library and get EP name(s). |
| plugin-ep-cuda/python/onnxruntime_ep_cuda/README.md | Documents runtime registration usage for the plugin EP. |
| plugin-ep-cuda/python/build_wheel.py | Implements templating + staging + wheel build + auditwheel repair flow. |
| plugin-ep-cuda/python/README.md | Documents how to build wheels via build_wheel.py. |
| plugin-ep-cuda/README.md | Top-level docs for CUDA plugin EP packaging and usage. |
| plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION | Establishes minimum compatible onnxruntime version for the wheel dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <copilot@github.com>
f74a62a to
3a5285e
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| scriptType: 'batch' | ||
| scriptLocation: 'inlineScript' | ||
| inlineScript: | | ||
| set AZCOPY_AUTO_LOGIN_TYPE=AZCLI |
There was a problem hiding this comment.
A lot of this code looked shared with the py-gpu packaging currently. Should we move it to a common template so we don't need to touch multiple places when updating?
There was a problem hiding this comment.
Sound good. We can refactor them later in follow-up PR.
I want to avoid impacting other pipelines for this initial version.
0cc3713 to
d6d3488
Compare
There was a problem hiding this comment.
lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
|
Suggestions: rename |
It follows webgpu pipeline. We can rename them together later. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description This change includes c++, c# and python language projections for the GetHardwareDeviceEpIncompatibilityDetails and related APIs added #26922 ### Motivation and Context - GetNumHardwareDevices / GetHardwareDevices — enumerate hardware devices (CPU, GPU, NPU) available on the system - GetHardwareDeviceEpIncompatibilityDetails — check known incompatibility issues between a device and an execution provider - DeviceEpIncompatibilityDetails_GetReasonsBitmask / GetNotes / GetErrorCode / Release — access and manage incompatibility detail results - OrtDeviceEpIncompatibilityReason enum — standard incompatibility reason flags ### Testing Added python, c++ and c# tests and verified that they passed. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request adds support for CUPTI-based GPU profiling to the CUDA plugin execution provider (EP) in ONNX Runtime. Profiling is now available in the plugin EP when built with the `onnxruntime_ENABLE_CUDA_PROFILING` CMake flag, enabling detailed GPU activity tracing and integration with ORT's profiling system. The implementation introduces a new `CudaPluginEpProfiler` that bridges between ORT's profiling API and CUPTI, and updates the build system, plugin interface, and documentation accordingly. **CUDA Plugin Profiling Integration:** * Added a new `CudaPluginEpProfiler` class (`cuda_profiler_plugin.h/.cc`) that implements the `OrtEpProfilerImpl` interface, delegates to a `CUPTIManager` singleton for GPU activity tracing, and provides callbacks for profiling lifecycle and event correlation. [[1]](diffhunk://#diff-1f42eda0693594c09576d132854290df0f39e439d450c79f50e01f9969d0af2dR1-R43) [[2]](diffhunk://#diff-1dccd750352acaba880066f09b8d8a042d13fae7b3dd5bc103f0ab43685ae2deR1-R148) * Updated the plugin EP interface in `cuda_ep.h`/`cuda_ep.cc` to conditionally provide a `CreateProfilerImpl` callback when profiling is enabled, wiring up the new profiler implementation. [[1]](diffhunk://#diff-82888350617a2e54bb30b1a11cd2563ecaf2b45ed0baba736674d9156c912b20R95-R99) [[2]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R137-R143) [[3]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R661-R678) * Modified the CMake build (`onnxruntime_providers_cuda_plugin.cmake`) to conditionally link against `CUDA::cupti` and define the necessary compile-time flags for profiling support. **Documentation Updates:** * Expanded the design documentation (`cuda_plugin_ep_design.md`) to describe the profiling and observability architecture, CUPTI integration, correlation ID flow, event collection, and differences from the in-tree CUDA EP profiler. Build configuration and relevant source files are also documented. **Miscellaneous:** * Included the new profiler header in the plugin EP implementation. * Minor test and import adjustments (e.g., `test_cuda_plugin_ep.py`). These changes enable the CUDA plugin EP to participate fully in ORT's profiling system, allowing users to observe GPU kernel and memory activity in conjunction with CPU-side events when profiling is enabled.
…gging (#28274) ### Description Move `owned_session_logger_` declaration before `execution_providers_` in `InferenceSession` so the logger outlives EPs during member destruction. C++ destroys members in reverse declaration order. Previously: ``` ~owned_session_logger_ (L905) // logger freed ↓ ~execution_providers_ (L745) // EP teardown logs via dangling pointer → crash ``` After this change: ``` ~execution_providers_ (L750) // EP teardown logs safely ✅ ↓ ~owned_session_logger_ (L744) // logger freed, no remaining users ``` ### Motivation and Context Plugin EPs receive an `OrtLogger*` via `OrtEpFactory::CreateEp()`. During session destruction, EP teardown callbacks (e.g., `ReleaseNodeComputeInfos`) may log through this pointer. Because `owned_session_logger_` was declared after `execution_providers_`, the logger was already freed when EPs destructed — a use-after-free that crashes deterministically under VERBOSE logging. Affects all Plugin EPs that log in any teardown path. Reproduced with OpenVINO Plugin EP via `webnn_graph_impl_fuzzer` at VERBOSE level. Fixes #28234 ### Tests Added Added regression tests in `onnxruntime/test/framework/inference_session_test.cc`: - **`LoggingOnDestroyExecutionProvider`** — A mock EP that logs via its stored logger pointer in its destructor. If the logger has been freed, this triggers a use-after-free (detected by ASan or as a segfault). - **`SessionLoggerOutlivesEPsOnDestruction`** — Creates a session with VERBOSE logging and the mock EP, then destroys the session. Verifies that the logger was valid during EP teardown and that the teardown log message was captured. - **`SessionLoggerOutlivesEPsWithMultipleEPs`** — Same scenario with two mock EPs (distinct type names) to confirm all registered EPs can safely log during teardown. ### Verification Confirmed the tests are effective regression tests: | Scenario | Result | |----------|--------| | **With fix** (logger declared before EPs) | Both tests pass ✅ | | **Without fix** (logger declared after EPs, original bug) | `SessionLoggerOutlivesEPsOnDestruction` crashes with **Segmentation fault** (exit code 139) — use-after-free ❌ | This proves the member declaration order is the critical factor, and the tests will catch any future regression that reorders these members. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
…28273) ### Description Restores the missing `node_compute_range_.Begin()` call in `KernelScope`'s constructor. The corresponding `End()` in the destructor was kept, so every kernel execution hits: ``` ORT_ENFORCE(is_begin_called_, "End must be called after calling Begin."); ``` ### Motivation and Context PR #26846 refactored `sequential_executor.cc` into RAII scope classes but dropped the `node_compute_range_.Begin()` call during the move. Any build with `--enable_nvtx_profile` fails at runtime on the first kernel execution. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
## Summary Periodic upstream sync of Intel's OVEP branch (`ovep_1_26_release`) into ORT main. All changes are scoped to the OpenVINO EP and its tests. ### OpenVINO 2026.0 / 2026.1 support - Add `V_2026_0` / `V_2026_1` version enums; `capability.cc` default bumped to `V_2026_1`. - Register FLOAT8E4M3FN / FLOAT8E5M2 initializer types on CPU / GPU / NPU. - Disable OVEP-level QDQ-stripping on OV ≥ 2026.1 (OV handles it internally). - Add `ReduceSum` to no-dimension-supported ops. ### KV-cache / stateful CausalLM - Rename `ReorderKVCache` → `SetReorderKVCacheStatus` across backend interfaces. - Populate `src_idx` / `dst_idx` in `PreProcessInferRequest` with shape validation; clean state after inference and on `RewindKVCache`. - `FuseCacheReorder`: `beam_idx` and `src_idx`/`dst_idx` paths are now mutually exclusive; reject models that already carry reorder inputs. - **Behavior change:** `RewindKVCache(index > 0)` now throws when reorder is enabled (physical KV-cache eviction pass is a TODO). ### NPU / provider options - Force `disable_dynamic_shapes=true` on NPU unless `enable_causallm` is set. - Preserve user-supplied `NPU_COMPILATION_MODE_PARAMS`; skip it when importing precompiled blobs. - Preserve factory-level `device_type` when session options don't override it (fixes NPU mis-selection from Python). - **Behavior change:** removed the `ORT_OPENVINO_NPU_COMPILER_TYPE` env override — OV's default NPU compiler is used now. ### External initializers / weight sharing - Drop the 32 MB embed threshold — always externalize when multiple external initializers are in memory. - `DumpOpenVINOEPModel` rebuilds a self-contained proto when initializer data was stripped. - `AddExternalWeight` validates re-adds against existing offset/size/location (parity with ABI EP); fix race in device-tensor mapping. - `ov_bin_manager`: bounds-checked pointer view over mapped weights (fixes read-only blob import). - `qdq_stripping`: use `std::from_chars` so offsets/lengths > 4 GB parse correctly. ### Perf-count dump - New `ORT_OPENVINO_PERF_COUNT=<dir>` env var writes per-subgraph CSV (`Layer Name,Status,Layer Type,Real Time (us),Exec Type`), replacing the old stdout-only debug dump. Requires `ov::enable_profiling` on the compiled model; logs a warning and no-ops otherwise. ### Misc - **API:** `IBackend::Infer` is no longer `const` (needed for perf-dump bookkeeping). - Filter orphaned graph outputs from OVEP sub-graphs. - Better error message for "cannot export dynamically compiled model" (points to `reshape_input`). - Human-readable `ovep_exception::type` strings. - `ov::shutdown()` on DLL unload. ### Tests - Add `OVEP_ExtInit_DynamicEmbed_Tests` and `OVEP_ExtInit_EmptyRawData_Tests`; refactor setup into `SetUpTestSuite`. - Narrow OVEP exclusions in `embed_layer_norm`, `fused_matmul`, `matmul_4bits`, `quantize_linear` (skip only unsupported sub-cases). - `perftest`: reset outputs per run to support data-dependent output shapes (e.g. NonZero). ## Testing Validated against the OpenVINO versions this release targets (2025.3 – 2026.1) on CPU / GPU / NPU: - New OVEP tests pass: `OVEP_ExtInit_Tests`, `OVEP_ExtInit_DynamicEmbed_Tests`, `OVEP_ExtInit_EmptyRawData_Tests` - Narrowed contrib-op exclusions verified against EmbedLayerNorm, FusedMatMul, MatMulNBits, QuantizeLinear - Stateful CausalLM flow exercised for KV-cache reorder + rewind - `ORT_OPENVINO_PERF_COUNT=<dir>` verified to produce per-subgraph CSVs - 2+ GB external-initializers-in-memory model loads on CPU / GPU / NPU --------- Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Signed-off-by: bfilipek <bartlomiej.filipek@intel.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com> Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com> Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com> Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com> Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com> Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com> Co-authored-by: n1harika <niharika.sathish@intel.com> Co-authored-by: TejalKhade28 <tejal.khade@intel.com> Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com> Co-authored-by: liang <gxgaoliang@126.com> Co-authored-by: Javier Martinez <javier.e.martinez@intel.com> Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: sfatimar <sahar.fatima@intel.com> Co-authored-by: Garth Long <garth.long@intel.com> Co-authored-by: Eric Crawford <eric.r.crawford@intel.com> Co-authored-by: derdeljan-msft <derdeljan@microsoft.com> Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com> Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com> Co-authored-by: Christopher Warrington <chwarr@microsoft.com> Co-authored-by: Ishwar Raut <iraut@nvidia.com> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Xinpeng Dou <15529241576@163.com> Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com> Co-authored-by: adrastogi <aditya.rastogi@microsoft.com> Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com> Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com> Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com> Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com> Co-authored-by: Adam Pocock <adam.pocock@oracle.com> Co-authored-by: Changming Sun <chasun@microsoft.com> Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com> Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com> Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com> Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com> Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com> Co-authored-by: Kotomi-Du <yaru.du@intel.com> Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com> Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com> Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com> Co-authored-by: bopeng1234 <bo.peng@intel.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com> Co-authored-by: Wenqin Yang <wenqin.yang@intel.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: xieofxie <xieofxie@126.com> Co-authored-by: hualxie <hualxie@microsoft.com> Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com> Co-authored-by: Joshua Lochner <admin@xenova.com> Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com> Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com> Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Jiawei Shao <jiawei.shao@intel.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: czekun <chen.zekun@intel.com> Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com> Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com> Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com> Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com> Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com> Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com>
### Description Fix CUDA 13 build failure introduced by PR #28198 (commit 997c479). **Root cause:** `gqa_unfused_attention.cu` directly includes `<cub/cub.cuh>`, which on CUDA 13.x transitively pulls in CCCL's `tcgen05_ld.h`. That header uses `__out` as a parameter name in inline PTX assembly, but on Windows MSVC the SAL annotation macro `#define __out` expands it, turning `__out[0]` into `[0]` — causing a parse error. **Fix:** Changed `#include <cub/cub.cuh>` to `#include "core/providers/cuda/cu_inc/cub.cuh"` — the existing ORT wrapper that `#undef __out` before including CUB. ### Motivation and Context The CUDA 13 packaging pipeline (`py-cuda13-packaging-pipeline.yml`) has been failing since PR #28198 was merged, with errors like: ``` E:/_work/_temp/v13.0/include/cccl/cuda/__ptx/instructions/generated/tcgen05_ld.h(20): error : expected an identifier asm("tcgen05.ld.sync.aligned.16x64b.x1.b32 {%0}, [%1];" : "=r"( [0]) : "r"(__taddr) : "memory"); ```
Co-authored-by: Copilot <copilot@github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ith 'import' and 'import from'' Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Description
Add Python wheel packaging support for the CUDA plugin EP, following the WebGPU plugin EP packaging pattern from #28226.
Changes include:
plugin-ep-cuda/pythonpackaging sources for theonnxruntime-ep-cudawheel.PluginPythonPackageVersion.Validation
Notes
The Linux Python package job is limited to x64 for now, matching the existing x64 plugin artifact packaging flow.