Skip to content

Add CUDA plugin EP Python package pipeline#28299

Open
tianleiwu wants to merge 25 commits intomainfrom
tlwu/fix_cuda_plugin_package_pipeline
Open

Add CUDA plugin EP Python package pipeline#28299
tianleiwu wants to merge 25 commits intomainfrom
tlwu/fix_cuda_plugin_package_pipeline

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Description

Add Python wheel packaging support for the CUDA plugin EP, following the WebGPU plugin EP packaging pattern from #28226.

Changes include:

  • Add plugin-ep-cuda/python packaging sources for the onnxruntime-ep-cuda wheel.
  • Add helper APIs to locate/register the CUDA plugin EP shared library.
  • Add Linux and Windows x64 Python package jobs that consume the CUDA plugin binary artifacts.
  • Extend plugin package version setup to emit a PEP 440-compatible PluginPythonPackageVersion.
  • Add a Linux Docker helper script to build the CUDA plugin Python wheel inside the manylinux CUDA image.

Validation

  • Parsed touched Azure pipeline YAML files with PyYAML.
  • Ran Python syntax checks for the new package helper and wheel builder.

Notes

The Linux Python package job is limited to x64 for now, matching the existing x64 plugin artifact packaging flow.

@tianleiwu tianleiwu marked this pull request as draft April 30, 2026 20:41
…lwu/fix_cuda_plugin_package_pipeline

# Conflicts:
#	tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml
#	tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml
#	tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml
#	tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml
#	tools/ci_build/github/azure-pipelines/templates/set-plugin-build-variables-step.yml
Comment thread plugin-ep-cuda/python/build_wheel.py Fixed
Comment thread plugin-ep-cuda/python/build_wheel.py Fixed
Comment thread plugin-ep-cuda/python/build_wheel.py Fixed
Comment thread plugin-ep-cuda/python/onnxruntime_ep_cuda/__init__.py Fixed
Comment thread plugin-ep-cuda/python/onnxruntime_ep_cuda/__init__.py Fixed
Comment thread plugin-ep-cuda/python/onnxruntime_ep_cuda/__init__.py Fixed
Comment thread plugin-ep-cuda/python/setup.py Fixed
Comment thread plugin-ep-cuda/python/setup.py Fixed
Comment thread plugin-ep-cuda/python/setup.py Fixed
@tianleiwu tianleiwu force-pushed the tlwu/fix_cuda_plugin_package_pipeline branch from 009ae2d to 189853a Compare April 30, 2026 22:11
@tianleiwu tianleiwu requested a review from Copilot April 30, 2026 22:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds CI and packaging infrastructure to build and publish CUDA Plugin EP Python wheels (mirroring the existing WebGPU plugin EP packaging approach).

Changes:

  • Introduces plugin-ep-cuda/python sources and a build_wheel.py script to produce platform wheels containing the CUDA plugin EP binary.
  • Extends Linux/Windows CUDA plugin Azure pipeline stages to build and publish Python wheel artifacts.
  • Updates manylinux CUDA Docker build to support configurable PIP_INDEX_URL.

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda Adds PIP_INDEX_URL build arg handling for dependency installation.
tools/ci_build/github/linux/build_cuda_plugin_python_package.sh New helper script to build the CUDA plugin EP wheel inside the manylinux CUDA image.
tools/ci_build/github/linux/build_cuda_plugin_package.sh Passes PIP_INDEX_URL through to container builds.
tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Adds Windows wheel build job and adjusts CUDA SDK acquisition / PATH behavior.
tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml Switches default pool to Ubuntu 24.04 and adds Linux wheel build job.
tools/ci_build/github/azure-pipelines/stages/plugin-cuda-packaging-stage.yml Plumbs python_package_name through the packaging stage template.
tools/ci_build/github/azure-pipelines/plugin-cuda-pipeline.yml Specifies CUDA-family-specific Python distribution names.
plugin-ep-cuda/python/setup.py Produces py3-none-{platform} wheels and marks distribution as non-pure.
plugin-ep-cuda/python/requirements-build-wheel.txt Adds wheel build tool dependencies (incl. auditwheel/patchelf on Linux).
plugin-ep-cuda/python/pyproject.toml.in Defines templated project metadata and package data patterns for binaries.
plugin-ep-cuda/python/onnxruntime_ep_cuda/init.py Adds helper APIs to locate the plugin library and get EP name(s).
plugin-ep-cuda/python/onnxruntime_ep_cuda/README.md Documents runtime registration usage for the plugin EP.
plugin-ep-cuda/python/build_wheel.py Implements templating + staging + wheel build + auditwheel repair flow.
plugin-ep-cuda/python/README.md Documents how to build wheels via build_wheel.py.
plugin-ep-cuda/README.md Top-level docs for CUDA plugin EP packaging and usage.
plugin-ep-cuda/MIN_ONNXRUNTIME_VERSION Establishes minimum compatible onnxruntime version for the wheel dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/ci_build/github/linux/build_cuda_plugin_python_package.sh
Comment thread tools/ci_build/github/linux/build_cuda_plugin_python_package.sh Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-win-cuda-stage.yml Outdated
Comment thread tools/ci_build/github/azure-pipelines/stages/plugin-linux-cuda-stage.yml Outdated
Co-authored-by: Copilot <copilot@github.com>
@tianleiwu tianleiwu force-pushed the tlwu/fix_cuda_plugin_package_pipeline branch from f74a62a to 3a5285e Compare May 1, 2026 00:05
@tianleiwu tianleiwu requested a review from Copilot May 1, 2026 00:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugin-ep-cuda/README.md Outdated
Comment thread plugin-ep-cuda/python/build_wheel.py Outdated
Comment thread tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda Outdated
scriptType: 'batch'
scriptLocation: 'inlineScript'
inlineScript: |
set AZCOPY_AUTO_LOGIN_TYPE=AZCLI
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code looked shared with the py-gpu packaging currently. Should we move it to a common template so we don't need to touch multiple places when updating?

Copy link
Copy Markdown
Contributor Author

@tianleiwu tianleiwu May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sound good. We can refactor them later in follow-up PR.
I want to avoid impacting other pipelines for this initial version.

@tianleiwu tianleiwu marked this pull request as ready for review May 1, 2026 02:20
@tianleiwu tianleiwu force-pushed the tlwu/fix_cuda_plugin_package_pipeline branch from 0cc3713 to d6d3488 Compare May 1, 2026 09:14
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Comment thread tools/ci_build/set_plugin_build_variables.py Outdated
Copy link
Copy Markdown
Contributor

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

tianleiwu and others added 2 commits May 1, 2026 02:42
Co-authored-by: Copilot <copilot@github.com>
@sanaa-hamel-microsoft
Copy link
Copy Markdown
Contributor

Suggestions: rename plugin-cuda-pipeline.yml -> packaging-plugin-cuda.yml?

@tianleiwu
Copy link
Copy Markdown
Contributor Author

Suggestions: rename plugin-cuda-pipeline.yml -> packaging-plugin-cuda.yml?

It follows webgpu pipeline. We can rename them together later.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/ci_build/github/linux/docker/Dockerfile.manylinux2_28_cuda Outdated
tianleiwu and others added 8 commits May 2, 2026 16:54
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
### Description
This change includes c++, c# and python language projections for the
GetHardwareDeviceEpIncompatibilityDetails and related APIs added #26922

### Motivation and Context
- GetNumHardwareDevices / GetHardwareDevices — enumerate hardware
devices (CPU, GPU, NPU) available on the system
- GetHardwareDeviceEpIncompatibilityDetails — check known
incompatibility issues between a device and an execution
provider
- DeviceEpIncompatibilityDetails_GetReasonsBitmask / GetNotes /
GetErrorCode / Release — access and manage
incompatibility detail results
- OrtDeviceEpIncompatibilityReason enum — standard incompatibility
reason flags

### Testing
Added python, c++ and c# tests and verified that they passed.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This pull request adds support for CUPTI-based GPU profiling to the CUDA
plugin execution provider (EP) in ONNX Runtime. Profiling is now
available in the plugin EP when built with the
`onnxruntime_ENABLE_CUDA_PROFILING` CMake flag, enabling detailed GPU
activity tracing and integration with ORT's profiling system. The
implementation introduces a new `CudaPluginEpProfiler` that bridges
between ORT's profiling API and CUPTI, and updates the build system,
plugin interface, and documentation accordingly.

**CUDA Plugin Profiling Integration:**

* Added a new `CudaPluginEpProfiler` class
(`cuda_profiler_plugin.h/.cc`) that implements the `OrtEpProfilerImpl`
interface, delegates to a `CUPTIManager` singleton for GPU activity
tracing, and provides callbacks for profiling lifecycle and event
correlation.
[[1]](diffhunk://#diff-1f42eda0693594c09576d132854290df0f39e439d450c79f50e01f9969d0af2dR1-R43)
[[2]](diffhunk://#diff-1dccd750352acaba880066f09b8d8a042d13fae7b3dd5bc103f0ab43685ae2deR1-R148)
* Updated the plugin EP interface in `cuda_ep.h`/`cuda_ep.cc` to
conditionally provide a `CreateProfilerImpl` callback when profiling is
enabled, wiring up the new profiler implementation.
[[1]](diffhunk://#diff-82888350617a2e54bb30b1a11cd2563ecaf2b45ed0baba736674d9156c912b20R95-R99)
[[2]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R137-R143)
[[3]](diffhunk://#diff-0890d267a71ca02f4173c2ab226e6c5707fcbbf6bbb5f602fa5d92aa82f42a80R661-R678)
* Modified the CMake build (`onnxruntime_providers_cuda_plugin.cmake`)
to conditionally link against `CUDA::cupti` and define the necessary
compile-time flags for profiling support.

**Documentation Updates:**

* Expanded the design documentation (`cuda_plugin_ep_design.md`) to
describe the profiling and observability architecture, CUPTI
integration, correlation ID flow, event collection, and differences from
the in-tree CUDA EP profiler. Build configuration and relevant source
files are also documented.

**Miscellaneous:**

* Included the new profiler header in the plugin EP implementation.
* Minor test and import adjustments (e.g., `test_cuda_plugin_ep.py`).

These changes enable the CUDA plugin EP to participate fully in ORT's
profiling system, allowing users to observe GPU kernel and memory
activity in conjunction with CPU-side events when profiling is enabled.
…gging (#28274)

### Description

Move `owned_session_logger_` declaration before `execution_providers_`
in `InferenceSession` so the logger outlives EPs during member
destruction.

C++ destroys members in reverse declaration order. Previously:

```
~owned_session_logger_   (L905)  // logger freed
     ↓
~execution_providers_    (L745)  // EP teardown logs via dangling pointer → crash
```

After this change:

```
~execution_providers_    (L750)  // EP teardown logs safely ✅
     ↓
~owned_session_logger_   (L744)  // logger freed, no remaining users
```

### Motivation and Context

Plugin EPs receive an `OrtLogger*` via `OrtEpFactory::CreateEp()`.
During session destruction, EP teardown callbacks (e.g.,
`ReleaseNodeComputeInfos`) may log through this pointer. Because
`owned_session_logger_` was declared after `execution_providers_`, the
logger was already freed when EPs destructed — a use-after-free that
crashes deterministically under VERBOSE logging.

Affects all Plugin EPs that log in any teardown path. Reproduced with
OpenVINO Plugin EP via `webnn_graph_impl_fuzzer` at VERBOSE level.

Fixes #28234

### Tests Added

Added regression tests in
`onnxruntime/test/framework/inference_session_test.cc`:

- **`LoggingOnDestroyExecutionProvider`** — A mock EP that logs via its
stored logger pointer in its destructor. If the logger has been freed,
this triggers a use-after-free (detected by ASan or as a segfault).
- **`SessionLoggerOutlivesEPsOnDestruction`** — Creates a session with
VERBOSE logging and the mock EP, then destroys the session. Verifies
that the logger was valid during EP teardown and that the teardown log
message was captured.
- **`SessionLoggerOutlivesEPsWithMultipleEPs`** — Same scenario with two
mock EPs (distinct type names) to confirm all registered EPs can safely
log during teardown.

### Verification

Confirmed the tests are effective regression tests:

| Scenario | Result |
|----------|--------|
| **With fix** (logger declared before EPs) | Both tests pass ✅ |
| **Without fix** (logger declared after EPs, original bug) |
`SessionLoggerOutlivesEPsOnDestruction` crashes with **Segmentation
fault** (exit code 139) — use-after-free ❌ |

This proves the member declaration order is the critical factor, and the
tests will catch any future regression that reorders these members.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
…28273)

### Description

Restores the missing `node_compute_range_.Begin()` call in
`KernelScope`'s constructor. The corresponding `End()` in the destructor
was kept, so every kernel execution hits:

```
ORT_ENFORCE(is_begin_called_, "End must be called after calling Begin.");
```

### Motivation and Context

PR #26846 refactored `sequential_executor.cc` into RAII scope classes
but dropped the `node_compute_range_.Begin()` call during the move. Any
build with `--enable_nvtx_profile` fails at runtime on the first kernel
execution.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
## Summary
Periodic upstream sync of Intel's OVEP branch (`ovep_1_26_release`) into
ORT main. All changes are scoped to the OpenVINO EP and its tests.

### OpenVINO 2026.0 / 2026.1 support
- Add `V_2026_0` / `V_2026_1` version enums; `capability.cc` default
bumped to `V_2026_1`.
- Register FLOAT8E4M3FN / FLOAT8E5M2 initializer types on CPU / GPU /
NPU.
- Disable OVEP-level QDQ-stripping on OV ≥ 2026.1 (OV handles it
internally).
- Add `ReduceSum` to no-dimension-supported ops.

### KV-cache / stateful CausalLM
- Rename `ReorderKVCache` → `SetReorderKVCacheStatus` across backend
interfaces.
- Populate `src_idx` / `dst_idx` in `PreProcessInferRequest` with shape
validation; clean state after inference and on `RewindKVCache`.
- `FuseCacheReorder`: `beam_idx` and `src_idx`/`dst_idx` paths are now
mutually exclusive; reject models that already carry reorder inputs.
- **Behavior change:** `RewindKVCache(index > 0)` now throws when
reorder is enabled (physical KV-cache eviction pass is a TODO).

### NPU / provider options
- Force `disable_dynamic_shapes=true` on NPU unless `enable_causallm` is
set.
- Preserve user-supplied `NPU_COMPILATION_MODE_PARAMS`; skip it when
importing precompiled blobs.
- Preserve factory-level `device_type` when session options don't
override it (fixes NPU mis-selection from Python).
- **Behavior change:** removed the `ORT_OPENVINO_NPU_COMPILER_TYPE` env
override — OV's default NPU compiler is used now.

### External initializers / weight sharing
- Drop the 32 MB embed threshold — always externalize when multiple
external initializers are in memory.
- `DumpOpenVINOEPModel` rebuilds a self-contained proto when initializer
data was stripped.
- `AddExternalWeight` validates re-adds against existing
offset/size/location (parity with ABI EP); fix race in device-tensor
mapping.
- `ov_bin_manager`: bounds-checked pointer view over mapped weights
(fixes read-only blob import).
- `qdq_stripping`: use `std::from_chars` so offsets/lengths > 4 GB parse
correctly.

### Perf-count dump
- New `ORT_OPENVINO_PERF_COUNT=<dir>` env var writes per-subgraph CSV
(`Layer Name,Status,Layer Type,Real Time (us),Exec Type`), replacing the
old stdout-only debug dump. Requires `ov::enable_profiling` on the
compiled model; logs a warning and no-ops otherwise.

### Misc
- **API:** `IBackend::Infer` is no longer `const` (needed for perf-dump
bookkeeping).
- Filter orphaned graph outputs from OVEP sub-graphs.
- Better error message for "cannot export dynamically compiled model"
(points to `reshape_input`).
- Human-readable `ovep_exception::type` strings.
- `ov::shutdown()` on DLL unload.

### Tests
- Add `OVEP_ExtInit_DynamicEmbed_Tests` and
`OVEP_ExtInit_EmptyRawData_Tests`; refactor setup into `SetUpTestSuite`.
- Narrow OVEP exclusions in `embed_layer_norm`, `fused_matmul`,
`matmul_4bits`, `quantize_linear` (skip only unsupported sub-cases).
- `perftest`: reset outputs per run to support data-dependent output
shapes (e.g. NonZero).

## Testing
Validated against the OpenVINO versions this release targets (2025.3 –
2026.1) on CPU / GPU / NPU:
- New OVEP tests pass: `OVEP_ExtInit_Tests`,
`OVEP_ExtInit_DynamicEmbed_Tests`, `OVEP_ExtInit_EmptyRawData_Tests`
- Narrowed contrib-op exclusions verified against EmbedLayerNorm,
FusedMatMul, MatMulNBits, QuantizeLinear
- Stateful CausalLM flow exercised for KV-cache reorder + rewind
- `ORT_OPENVINO_PERF_COUNT=<dir>` verified to produce per-subgraph CSVs
- 2+ GB external-initializers-in-memory model loads on CPU / GPU / NPU

---------

Signed-off-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Signed-off-by: bfilipek <bartlomiej.filipek@intel.com>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: Christian Bourjau <christian.bourjau@quantco.com>
Co-authored-by: Ryan Metcalfe <107415876+RyanMetcalfeInt8@users.noreply.github.com>
Co-authored-by: jatinwadhwa921 <110383850+jatinwadhwa921@users.noreply.github.com>
Co-authored-by: Jaswanth Gannamaneni <jaswanth.gannamaneni@intel.com>
Co-authored-by: Klimenko, Mikhail <mikhail.klimenko@intel.com>
Co-authored-by: Vishnudas Thaniel S <vishnudas.thaniel.s@intel.com>
Co-authored-by: n1harika <niharika.sathish@intel.com>
Co-authored-by: TejalKhade28 <tejal.khade@intel.com>
Co-authored-by: Preetha Veeramalai <preetha.veeramalai@intel.com>
Co-authored-by: liang <gxgaoliang@126.com>
Co-authored-by: Javier Martinez <javier.e.martinez@intel.com>
Co-authored-by: MayureshV1 <47039074+MayureshV1@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: sfatimar <sahar.fatima@intel.com>
Co-authored-by: Garth Long <garth.long@intel.com>
Co-authored-by: Eric Crawford <eric.r.crawford@intel.com>
Co-authored-by: derdeljan-msft <derdeljan@microsoft.com>
Co-authored-by: Jonathan Clohessy <jonathan.clohessy@arm.com>
Co-authored-by: Akshay Sonawane <111780983+apsonawane@users.noreply.github.com>
Co-authored-by: Christopher Warrington <chwarr@microsoft.com>
Co-authored-by: Ishwar Raut <iraut@nvidia.com>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Xinpeng Dou <15529241576@163.com>
Co-authored-by: Chi Lo <54722500+chilo-ms@users.noreply.github.com>
Co-authored-by: adrastogi <aditya.rastogi@microsoft.com>
Co-authored-by: Aditya Rastogi <adityar@ntdev.microsoft.com>
Co-authored-by: qti-hungjuiw <hungjuiw@qti.qualcomm.com>
Co-authored-by: qti-yuduo <yuduow@qti.qualcomm.com>
Co-authored-by: Pradeep Sakhamoori <psakhamoori@microsoft.com>
Co-authored-by: Adam Pocock <adam.pocock@oracle.com>
Co-authored-by: Changming Sun <chasun@microsoft.com>
Co-authored-by: mingyue <131847423+mingyueliuh@users.noreply.github.com>
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
Co-authored-by: Susanta Bhattacharjee <susanta.bhattacharjee@intel.com>
Co-authored-by: jatinwadhwa921 <jatin.wadhwa@intel.com>
Co-authored-by: Jozef Wludzik <jozef.wludzik@intel.com>
Co-authored-by: Bartlomiej Filipek <bartlomiej.filipek@intel.com>
Co-authored-by: Kotomi-Du <yaru.du@intel.com>
Co-authored-by: Rajeev Sekar <rajeevsekar21@gmail.com>
Co-authored-by: Mayuresh M Varerkar <mayuresh.m.varerkar@intel.com>
Co-authored-by: Mikhail Dvoretckii <mikhail.dvoretckii@intel.com>
Co-authored-by: bopeng1234 <bo.peng@intel.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: fs-eire <7679871+fs-eire@users.noreply.github.com>
Co-authored-by: Wenqin Yang <wenqin.yang@intel.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: xieofxie <xieofxie@126.com>
Co-authored-by: hualxie <hualxie@microsoft.com>
Co-authored-by: Jiajia Qin <jiajiaqin@microsoft.com>
Co-authored-by: Joshua Lochner <admin@xenova.com>
Co-authored-by: Christian Bourjau <cbourjau@users.noreply.github.com>
Co-authored-by: Xiaofei Han <xiaofeihan@microsoft.com>
Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com>
Co-authored-by: chunghow-qti <chunghow@qti.qualcomm.com>
Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com>
Co-authored-by: Jiawei Shao <jiawei.shao@intel.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: czekun <chen.zekun@intel.com>
Co-authored-by: Ryan Metcalfe <ryan.metcalfe@intel.com>
Co-authored-by: Jaskaran Singh Nagi <jaskaran.singh.nagi@intel.com>
Co-authored-by: ai-fw-intg <sys_ai_fw_intg@intel.com>
Co-authored-by: Rajeev Sekar <rajeev.sekar@intel.com>
Co-authored-by: RajeevSekar <117911837+RajeevSekar@users.noreply.github.com>
Co-authored-by: Nazanin Beheshti <nazanin.beheshti@intel.com>
### Description
Fix CUDA 13 build failure introduced by PR #28198 (commit 997c479).

**Root cause:** `gqa_unfused_attention.cu` directly includes
`<cub/cub.cuh>`, which on CUDA 13.x transitively pulls in CCCL's
`tcgen05_ld.h`. That header uses `__out` as a parameter name in inline
PTX assembly, but on Windows MSVC the SAL annotation macro `#define
__out` expands it, turning `__out[0]` into `[0]` — causing a parse
error.

**Fix:** Changed `#include <cub/cub.cuh>` to `#include
"core/providers/cuda/cu_inc/cub.cuh"` — the existing ORT wrapper that
`#undef __out` before including CUB.

### Motivation and Context
The CUDA 13 packaging pipeline (`py-cuda13-packaging-pipeline.yml`) has
been failing since PR #28198 was merged, with errors like:
```
E:/_work/_temp/v13.0/include/cccl/cuda/__ptx/instructions/generated/tcgen05_ld.h(20): error : expected an identifier
      asm("tcgen05.ld.sync.aligned.16x64b.x1.b32 {%0}, [%1];" : "=r"( [0]) : "r"(__taddr) : "memory");
```
Co-authored-by: Copilot <copilot@github.com>
@tianleiwu tianleiwu requested a review from Copilot May 3, 2026 06:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread plugin-ep-cuda/python/test/test_cuda_plugin_ep.py Fixed
tianleiwu and others added 2 commits May 2, 2026 23:54
…ith 'import' and 'import from''

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants