Skip to content

Obtain temp storage size and alignment directly from LTO IR via PTX conversion.#5355

Merged
tpn merged 1 commit into
NVIDIA:mainfrom
tpn:5346-add-lto-ptx-opt-for-cuda-coop
Sep 3, 2025
Merged

Obtain temp storage size and alignment directly from LTO IR via PTX conversion.#5355
tpn merged 1 commit into
NVIDIA:mainfrom
tpn:5346-add-lto-ptx-opt-for-cuda-coop

Conversation

@tpn

@tpn tpn commented Jul 24, 2025

Copy link
Copy Markdown
Contributor

This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (test_block_exchange.py went from 1m 23s to about 33s with this change in place, for example.)

@tpn tpn self-assigned this Jul 24, 2025
@tpn tpn requested a review from a team as a code owner July 24, 2025 00:10
@tpn tpn requested a review from shwina July 24, 2025 00:10
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jul 24, 2025
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jul 24, 2025
@github-actions

Copy link
Copy Markdown
Contributor
🟨 CI finished in 37m 29s: Pass: 77%/22 | Total: 2h 58m | Avg: 8m 07s | Max: 21m 09s
  • 🟨 python: Pass: 77%/22 | Total: 2h 58m | Avg: 8m 07s | Max: 21m 09s

    🚨 jobs: Test cuda.cccl.cooperative 🚨
      🟩 Build cuda.cccl    Pass: 100%/2   | Total: 19m 12s | Avg:  9m 36s | Max:  9m 46s
      🔥 Test cuda.cccl.cooperative Pass:   0%/5   | Total: 20m 42s | Avg:  4m 08s | Max:  5m 02s
      🟩 Test cuda.cccl.examples Pass: 100%/5   | Total: 21m 43s | Avg:  4m 20s | Max:  5m 26s
      🟩 Test cuda.cccl.headers Pass: 100%/5   | Total: 20m 36s | Avg:  4m 07s | Max:  5m 07s
      🟩 Test cuda.cccl.parallel Pass: 100%/5   | Total:  1h 36m | Avg: 19m 18s | Max: 21m 09s
    🟨 ctk
      🟨 12.5               Pass:  66%/6   | Total: 22m 35s | Avg:  3m 45s | Max:  3m 50s
      🟩 12.8               Pass: 100%/2   | Total: 38m 35s | Avg: 19m 17s | Max: 19m 23s
      🟨 12.9               Pass:  78%/14  | Total:  1h 57m | Avg:  8m 23s | Max: 21m 09s
    🟨 cudacxx
      🟨 nvcc12.5           Pass:  66%/6   | Total: 22m 35s | Avg:  3m 45s | Max:  3m 50s
      🟩 nvcc12.8           Pass: 100%/2   | Total: 38m 35s | Avg: 19m 17s | Max: 19m 23s
      🟨 nvcc12.9           Pass:  78%/14  | Total:  1h 57m | Avg:  8m 23s | Max: 21m 09s
    🟨 cpu
      🟨 amd64              Pass:  77%/22  | Total:  2h 58m | Avg:  8m 07s | Max: 21m 09s
    🟨 cudacxx_family
      🟨 nvcc               Pass:  77%/22  | Total:  2h 58m | Avg:  8m 07s | Max: 21m 09s
    🟨 cxx
      🟨 GCC13              Pass:  77%/22  | Total:  2h 58m | Avg:  8m 07s | Max: 21m 09s
    🟨 cxx_family
      🟨 GCC                Pass:  77%/22  | Total:  2h 58m | Avg:  8m 07s | Max: 21m 09s
    🟨 gpu
      🟨 h100               Pass:  75%/4   | Total: 36m 44s | Avg:  9m 11s | Max: 21m 09s
      🟨 l4                 Pass:  77%/18  | Total:  2h 21m | Avg:  7m 53s | Max: 19m 23s
    🟨 py_version
      🟨 3.10               Pass:  77%/9   | Total:  1h 11m | Avg:  7m 55s | Max: 19m 12s
      🟨 3.13               Pass:  76%/13  | Total:  1h 47m | Avg:  8m 15s | Max: 21m 09s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

🏃‍ Runner counts (total jobs: 22)

# Runner
16 linux-amd64-gpu-l4-latest-1
4 linux-amd64-gpu-h100-latest-1
2 linux-amd64-cpu16

@isVoid

isVoid commented Jul 24, 2025

Copy link
Copy Markdown

@tpn I don't think this PR depends on NVIDIA/numba-cuda#326. You are directly using Numba's linker binding to get the ptx from ltoir, which was introduced in the cuda-python linker work in NVIDIA/numba-cuda#133, this was cut in 0.16.0.

@tpn

tpn commented Jul 24, 2025

Copy link
Copy Markdown
Contributor Author

Ah looks like the issue is lack of LTOIR:

/home/coder/.pyenv/versions/3.10.18/lib/python3.10/site-packages/cuda/cccl/cooperative/experimental/block/_block_exchange.py:68: in <module>
    from .._types import (
/home/coder/.pyenv/versions/3.10.18/lib/python3.10/site-packages/cuda/cccl/cooperative/experimental/_types.py:17: in <module>
    from numba.cuda import LTOIR
E   ImportError: cannot import name 'LTOIR' from 'numba.cuda' (/home/coder/.pyenv/versions/3.10.18/lib/python3.10/site-packages/numba/cuda/__init__.py)

Which... looks like it was added 9 months ago. Weird.

@tpn

tpn commented Jul 24, 2025

Copy link
Copy Markdown
Contributor Author

Turns out we weren't even using numba-cuda in CI! Just old-school numba.cuda. Added it as an explicit depdency to pyproject.toml.

@tpn tpn requested a review from a team as a code owner July 24, 2025 21:36
@tpn tpn requested a review from fbusato July 24, 2025 21:36
@tpn tpn requested a review from a team as a code owner July 24, 2025 21:38
@github-actions

Copy link
Copy Markdown
Contributor
🟨 CI finished in 1h 22m: Pass: 99%/161 | Total: 1d 09h | Avg: 12m 28s | Max: 1h 20m | Hits: 97%/152388
  • 🟨 cub: Pass: 98%/50 | Total: 15h 50m | Avg: 19m 01s | Max: 1h 20m | Hits: 97%/51848

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  97%/48  | Total: 15h 36m | Avg: 19m 30s | Max:  1h 20m | Hits:  97%/49322 
      🟩 arm64              Pass: 100%/2   | Total: 14m 31s | Avg:  7m 15s | Max:  8m 33s | Hits:  99%/2526  
    🔍 ctk: 12.9 🔍
      🟩 12.0               Pass: 100%/5   | Total:  1h 35m | Avg: 19m 09s | Max:  1h 05m | Hits:  97%/6211  
      🔍 12.9               Pass:  97%/45  | Total: 14h 15m | Avg: 19m 00s | Max:  1h 20m | Hits:  97%/45637 
    🔍 cudacxx: nvcc12.9 🔍
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  5m 40s | Hits:  99%/2175  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 35m | Avg: 19m 09s | Max:  1h 05m | Hits:  97%/6211  
      🔍 nvcc12.9           Pass:  97%/43  | Total: 14h 03m | Avg: 19m 37s | Max:  1h 20m | Hits:  97%/43462 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 11m 10s | Avg:  5m 35s | Max:  5m 40s | Hits:  99%/2175  
      🔍 nvcc               Pass:  97%/48  | Total: 15h 39m | Avg: 19m 34s | Max:  1h 20m | Hits:  97%/49673 
    🔍 cxx: GCC13 🔍
      🟩 Clang14            Pass: 100%/4   | Total: 27m 18s | Avg:  6m 49s | Max:  7m 19s | Hits:  99%/5054  
      🟩 Clang15            Pass: 100%/2   | Total: 14m 22s | Avg:  7m 11s | Max:  7m 27s | Hits:  99%/2523  
      🟩 Clang16            Pass: 100%/2   | Total: 14m 27s | Avg:  7m 13s | Max:  7m 26s | Hits:  99%/2523  
      🟩 Clang17            Pass: 100%/2   | Total: 13m 39s | Avg:  6m 49s | Max:  7m 03s | Hits:  99%/2523  
      🟩 Clang18            Pass: 100%/2   | Total: 13m 30s | Avg:  6m 45s | Max:  6m 55s | Hits:  99%/2523  
      🟩 Clang19            Pass: 100%/7   | Total:  1h 17m | Avg: 11m 01s | Max: 24m 17s | Hits:  99%/5960  
      🟩 GCC7               Pass: 100%/2   | Total: 16m 40s | Avg:  8m 20s | Max:  8m 33s | Hits:  99%/2526  
      🟩 GCC8               Pass: 100%/1   | Total:  9m 18s | Avg:  9m 18s | Max:  9m 18s | Hits:  99%/1263  
      🟩 GCC9               Pass: 100%/2   | Total: 18m 30s | Avg:  9m 15s | Max:  9m 31s | Hits:  99%/2526  
      🟩 GCC10              Pass: 100%/2   | Total: 18m 25s | Avg:  9m 12s | Max:  9m 30s | Hits:  99%/2527  
      🟩 GCC11              Pass: 100%/2   | Total: 17m 33s | Avg:  8m 46s | Max:  8m 47s | Hits:  99%/2523  
      🟩 GCC12              Pass: 100%/2   | Total: 19m 13s | Avg:  9m 36s | Max:  9m 58s | Hits:  99%/2523  
      🔍 GCC13              Pass:  91%/12  | Total:  3h 13m | Avg: 16m 07s | Max: 30m 55s | Hits:  99%/7581  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 11m | Hits:  87%/2316  
      🟩 MSVC14.43          Pass: 100%/4   | Total:  4h 17m | Avg:  1h 04m | Max:  1h 20m | Hits:  87%/4632  
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  1h 41m | Avg: 50m 56s | Max: 51m 29s | Hits:  86%/2325  
    🔍 cxx_family: GCC 🔍
      🟩 Clang              Pass: 100%/19  | Total:  2h 40m | Avg:  8m 26s | Max: 24m 17s | Hits:  99%/21106 
      🔍 GCC                Pass:  95%/23  | Total:  4h 53m | Avg: 12m 44s | Max: 30m 55s | Hits:  99%/21469 
      🟩 MSVC               Pass: 100%/6   | Total:  6h 35m | Avg:  1h 05m | Max:  1h 20m | Hits:  87%/6948  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 41m | Avg: 50m 56s | Max: 51m 29s | Hits:  86%/2325  
    🔍 gpu: rtxa6000 🔍
      🟩 h100               Pass: 100%/3   | Total:  1h 00m | Avg: 20m 07s | Max: 30m 55s | Hits:  99%/1264  
      🟩 rtx2080            Pass: 100%/39  | Total: 12h 18m | Avg: 18m 56s | Max:  1h 20m | Hits:  97%/48058 
      🔍 rtxa6000           Pass:  87%/8   | Total:  2h 31m | Avg: 18m 57s | Max: 25m 05s | Hits:  99%/2526  
    🔍 jobs: HostLaunch 🔍
      🟩 Build              Pass: 100%/42  | Total: 12h 42m | Avg: 18m 09s | Max:  1h 20m | Hits:  97%/51848 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 11s | Avg: 24m 11s | Max: 24m 11s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 54s | Avg: 15m 54s | Max: 15m 54s
      🔍 HostLaunch         Pass:  66%/3   | Total:  1h 20m | Avg: 26m 45s | Max: 30m 55s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 07m | Avg: 22m 33s | Max: 23m 36s
    🔍 std: 20 🔍
      🟩 17                 Pass: 100%/21  | Total:  6h 37m | Avg: 18m 56s | Max:  1h 13m | Hits:  97%/25915 
      🔍 20                 Pass:  96%/29  | Total:  9h 13m | Avg: 19m 04s | Max:  1h 20m | Hits:  97%/25933 
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total:  1h 00m | Avg: 20m 07s | Max: 30m 55s | Hits:  99%/1264  
      🟩 90;90a             Pass: 100%/2   | Total: 57m 44s | Avg: 28m 52s | Max: 49m 58s | Hits:  93%/2422  
      🟩 100;120            Pass: 100%/2   | Total:  1h 01m | Avg: 30m 40s | Max: 53m 02s | Hits:  93%/2422  
    
  • 🟩 thrust: Pass: 100%/50 | Total: 10h 39m | Avg: 12m 47s | Max: 42m 53s | Hits: 97%/84139

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 14m 03s | Avg:  7m 01s | Max:  8m 35s | Hits:  99%/1914  
    🟩 cpu
      🟩 amd64              Pass: 100%/48  | Total: 10h 27m | Avg: 13m 04s | Max: 42m 53s | Hits:  97%/80312 
      🟩 arm64              Pass: 100%/2   | Total: 11m 57s | Avg:  5m 58s | Max:  6m 55s | Hits:  99%/3827  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  1h 03m | Avg: 12m 45s | Max: 34m 20s | Hits:  94%/9560  
      🟩 12.9               Pass: 100%/45  | Total:  9h 35m | Avg: 12m 47s | Max: 42m 53s | Hits:  98%/74579 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 10m 56s | Avg:  5m 28s | Max:  5m 30s | Hits: 100%/3826  
      🟩 nvcc12.0           Pass: 100%/5   | Total:  1h 03m | Avg: 12m 45s | Max: 34m 20s | Hits:  94%/9560  
      🟩 nvcc12.9           Pass: 100%/43  | Total:  9h 24m | Avg: 13m 08s | Max: 42m 53s | Hits:  98%/70753 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 56s | Avg:  5m 28s | Max:  5m 30s | Hits: 100%/3826  
      🟩 nvcc               Pass: 100%/48  | Total: 10h 28m | Avg: 13m 05s | Max: 42m 53s | Hits:  97%/80313 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 22m 49s | Avg:  5m 42s | Max:  6m 19s | Hits: 100%/7652  
      🟩 Clang15            Pass: 100%/2   | Total: 12m 24s | Avg:  6m 12s | Max:  6m 30s | Hits: 100%/3826  
      🟩 Clang16            Pass: 100%/2   | Total: 11m 58s | Avg:  5m 59s | Max:  6m 05s | Hits: 100%/3826  
      🟩 Clang17            Pass: 100%/2   | Total: 12m 42s | Avg:  6m 21s | Max:  6m 29s | Hits: 100%/3826  
      🟩 Clang18            Pass: 100%/2   | Total: 11m 54s | Avg:  5m 57s | Max:  6m 03s | Hits: 100%/3826  
      🟩 Clang19            Pass: 100%/7   | Total: 38m 16s | Avg:  5m 28s | Max:  6m 23s | Hits: 100%/9565  
      🟩 GCC7               Pass: 100%/2   | Total: 18m 45s | Avg:  9m 22s | Max: 11m 26s | Hits:  86%/3828  
      🟩 GCC8               Pass: 100%/1   | Total: 42m 17s | Avg: 42m 17s | Max: 42m 17s | Hits:  51%/1914  
      🟩 GCC9               Pass: 100%/2   | Total: 15m 36s | Avg:  7m 48s | Max:  8m 06s | Hits:  99%/3828  
      🟩 GCC10              Pass: 100%/2   | Total: 14m 29s | Avg:  7m 14s | Max:  7m 16s | Hits:  99%/3828  
      🟩 GCC11              Pass: 100%/2   | Total: 15m 10s | Avg:  7m 35s | Max:  7m 37s | Hits:  99%/3828  
      🟩 GCC12              Pass: 100%/2   | Total: 16m 32s | Avg:  8m 16s | Max:  8m 34s | Hits:  99%/3828  
      🟩 GCC13              Pass: 100%/11  | Total:  1h 15m | Avg:  6m 51s | Max:  9m 10s | Hits:  99%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 13m | Avg: 36m 46s | Max: 39m 12s | Hits:  98%/3812  
      🟩 MSVC14.43          Pass: 100%/5   | Total:  2h 57m | Avg: 35m 29s | Max: 42m 53s | Hits:  98%/9530  
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  1h 20m | Avg: 40m 09s | Max: 41m 56s | Hits:  94%/3824  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  1h 50m | Avg:  5m 47s | Max:  6m 30s | Hits: 100%/32521 
      🟩 GCC                Pass: 100%/22  | Total:  3h 18m | Avg:  9m 00s | Max: 42m 17s | Hits:  95%/34452 
      🟩 MSVC               Pass: 100%/7   | Total:  4h 10m | Avg: 35m 51s | Max: 42m 53s | Hits:  98%/13342 
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 20m | Avg: 40m 09s | Max: 41m 56s | Hits:  94%/3824  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 13m 12s | Avg:  6m 36s | Max:  7m 12s | Hits:  99%/1914  
      🟩 rtx2080            Pass: 100%/38  | Total:  8h 20m | Avg: 13m 10s | Max: 42m 17s | Hits:  97%/72672 
      🟩 rtx4090            Pass: 100%/10  | Total:  2h 05m | Avg: 12m 34s | Max: 42m 53s | Hits:  99%/9553  
    🟩 jobs
      🟩 Build              Pass: 100%/43  | Total:  9h 33m | Avg: 13m 20s | Max: 42m 53s | Hits:  97%/82233 
      🟩 TestCPU            Pass: 100%/3   | Total: 41m 33s | Avg: 13m 51s | Max: 33m 43s | Hits:  99%/1906  
      🟩 TestGPU            Pass: 100%/4   | Total: 24m 25s | Avg:  6m 06s | Max:  7m 12s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 13m 12s | Avg:  6m 36s | Max:  7m 12s | Hits:  99%/1914  
      🟩 90;90a             Pass: 100%/2   | Total: 36m 45s | Avg: 18m 22s | Max: 30m 10s | Hits:  99%/3820  
      🟩 100;120            Pass: 100%/2   | Total: 39m 25s | Avg: 19m 42s | Max: 31m 58s | Hits:  99%/3820  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  5h 09m | Avg: 14m 43s | Max: 42m 17s | Hits:  95%/40160 
      🟩 20                 Pass: 100%/27  | Total:  5h 16m | Avg: 11m 43s | Max: 42m 53s | Hits:  99%/42065 
    
  • 🟩 cudax: Pass: 100%/28 | Total: 2h 33m | Avg: 5m 28s | Max: 12m 19s | Hits: 99%/15906

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  2h 21m | Avg:  5m 53s | Max: 12m 19s | Hits:  99%/13462 
      🟩 arm64              Pass: 100%/4   | Total: 11m 54s | Avg:  2m 58s | Max:  3m 19s | Hits:  99%/2444  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 17m 31s | Avg:  5m 50s | Max: 11m 06s | Hits:  99%/1531  
      🟩 12.9               Pass: 100%/25  | Total:  2h 15m | Avg:  5m 25s | Max: 12m 19s | Hits:  99%/14375 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 17m 31s | Avg:  5m 50s | Max: 11m 06s | Hits:  99%/1531  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  2h 15m | Avg:  5m 25s | Max: 12m 19s | Hits:  99%/14375 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  2h 33m | Avg:  5m 28s | Max: 12m 19s | Hits:  99%/15906 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total:  6m 14s | Avg:  3m 07s | Max:  3m 18s | Hits: 100%/1224  
      🟩 Clang15            Pass: 100%/1   | Total:  3m 19s | Avg:  3m 19s | Max:  3m 19s | Hits: 100%/611   
      🟩 Clang16            Pass: 100%/1   | Total:  3m 22s | Avg:  3m 22s | Max:  3m 22s | Hits: 100%/611   
      🟩 Clang17            Pass: 100%/1   | Total:  3m 28s | Avg:  3m 28s | Max:  3m 28s | Hits: 100%/611   
      🟩 Clang18            Pass: 100%/1   | Total:  3m 24s | Avg:  3m 24s | Max:  3m 24s | Hits: 100%/611   
      🟩 Clang19            Pass: 100%/4   | Total: 17m 10s | Avg:  4m 17s | Max:  8m 24s | Hits: 100%/2444  
      🟩 GCC10              Pass: 100%/2   | Total:  7m 04s | Avg:  3m 32s | Max:  3m 35s | Hits:  99%/1224  
      🟩 GCC11              Pass: 100%/1   | Total:  3m 54s | Avg:  3m 54s | Max:  3m 54s | Hits:  99%/611   
      🟩 GCC12              Pass: 100%/1   | Total:  3m 47s | Avg:  3m 47s | Max:  3m 47s | Hits:  99%/611   
      🟩 GCC13              Pass: 100%/8   | Total: 39m 05s | Avg:  4m 53s | Max: 10m 25s | Hits:  99%/4888  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 06s | Avg: 11m 06s | Max: 11m 06s | Hits:  95%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 34m 57s | Avg: 11m 39s | Max: 12m 19s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 16m 20s | Avg:  8m 10s | Max:  8m 24s | Hits:  97%/1218  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total: 36m 57s | Avg:  3m 41s | Max:  8m 24s | Hits: 100%/6112  
      🟩 GCC                Pass: 100%/12  | Total: 53m 50s | Avg:  4m 29s | Max: 10m 25s | Hits:  99%/7334  
      🟩 MSVC               Pass: 100%/4   | Total: 46m 03s | Avg: 11m 30s | Max: 12m 19s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 16m 20s | Avg:  8m 10s | Max:  8m 24s | Hits:  97%/1218  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  7m 23s | Hits:  99%/1222  
      🟩 rtx2080            Pass: 100%/26  | Total:  2h 22m | Avg:  5m 28s | Max: 12m 19s | Hits:  99%/14684 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  2h 06m | Avg:  5m 04s | Max: 12m 19s | Hits:  99%/14073 
      🟩 Test               Pass: 100%/3   | Total: 26m 12s | Avg:  8m 44s | Max: 10m 25s | Hits:  99%/1833  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 10m 48s | Avg:  5m 24s | Max:  7m 23s | Hits:  99%/1222  
      🟩 90;90a             Pass: 100%/2   | Total: 14m 34s | Avg:  7m 17s | Max: 10m 55s | Hits:  98%/922   
      🟩 100;120            Pass: 100%/2   | Total: 15m 22s | Avg:  7m 41s | Max: 11m 43s | Hits:  98%/922   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 14m 18s | Avg:  4m 46s | Max:  8m 24s | Hits:  98%/1831  
      🟩 20                 Pass: 100%/25  | Total:  2h 18m | Avg:  5m 33s | Max: 12m 19s | Hits:  99%/14075 
    
  • 🟩 python: Pass: 100%/22 | Total: 3h 30m | Avg: 9m 35s | Max: 21m 01s

    🟩 cpu
      🟩 amd64              Pass: 100%/22  | Total:  3h 30m | Avg:  9m 35s | Max: 21m 01s
    🟩 ctk
      🟩 12.5               Pass: 100%/6   | Total: 36m 14s | Avg:  6m 02s | Max: 10m 10s
      🟩 12.8               Pass: 100%/2   | Total: 37m 59s | Avg: 18m 59s | Max: 19m 01s
      🟩 12.9               Pass: 100%/14  | Total:  2h 16m | Avg:  9m 45s | Max: 21m 01s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/6   | Total: 36m 14s | Avg:  6m 02s | Max: 10m 10s
      🟩 nvcc12.8           Pass: 100%/2   | Total: 37m 59s | Avg: 18m 59s | Max: 19m 01s
      🟩 nvcc12.9           Pass: 100%/14  | Total:  2h 16m | Avg:  9m 45s | Max: 21m 01s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/22  | Total:  3h 30m | Avg:  9m 35s | Max: 21m 01s
    🟩 cxx
      🟩 GCC13              Pass: 100%/22  | Total:  3h 30m | Avg:  9m 35s | Max: 21m 01s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/22  | Total:  3h 30m | Avg:  9m 35s | Max: 21m 01s
    🟩 gpu
      🟩 h100               Pass: 100%/4   | Total: 42m 20s | Avg: 10m 35s | Max: 21m 01s
      🟩 l4                 Pass: 100%/18  | Total:  2h 48m | Avg:  9m 21s | Max: 19m 23s
    🟩 jobs
      🟩 Build cuda.cccl    Pass: 100%/2   | Total: 18m 51s | Avg:  9m 25s | Max:  9m 28s
      🟩 Test cuda.cccl.cooperative Pass: 100%/5   | Total: 52m 26s | Avg: 10m 29s | Max: 12m 03s
      🟩 Test cuda.cccl.examples Pass: 100%/5   | Total: 21m 18s | Avg:  4m 15s | Max:  4m 43s
      🟩 Test cuda.cccl.headers Pass: 100%/5   | Total: 21m 16s | Avg:  4m 15s | Max:  5m 12s
      🟩 Test cuda.cccl.parallel Pass: 100%/5   | Total:  1h 37m | Avg: 19m 24s | Max: 21m 01s
    🟩 py_version
      🟩 3.10               Pass: 100%/9   | Total:  1h 25m | Avg:  9m 28s | Max: 19m 23s
      🟩 3.13               Pass: 100%/13  | Total:  2h 05m | Avg:  9m 40s | Max: 21m 01s
    
  • 🟩 packaging: Pass: 100%/4 | Total: 11m 44s | Avg: 2m 56s | Max: 3m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 11m 44s | Avg:  2m 56s | Max:  3m 23s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  5m 04s | Avg:  2m 32s | Max:  2m 46s
      🟩 12.9               Pass: 100%/2   | Total:  6m 40s | Avg:  3m 20s | Max:  3m 23s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  5m 04s | Avg:  2m 32s | Max:  2m 46s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  6m 40s | Avg:  3m 20s | Max:  3m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 11m 44s | Avg:  2m 56s | Max:  3m 23s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  2m 46s | Avg:  2m 46s | Max:  2m 46s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 17s | Avg:  3m 17s | Max:  3m 17s
      🟩 GCC12              Pass: 100%/1   | Total:  2m 18s | Avg:  2m 18s | Max:  2m 18s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  6m 03s | Avg:  3m 01s | Max:  3m 17s
      🟩 GCC                Pass: 100%/2   | Total:  5m 41s | Avg:  2m 50s | Max:  3m 23s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 11m 44s | Avg:  2m 56s | Max:  3m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 11m 44s | Avg:  2m 56s | Max:  3m 23s
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 17m 02s | Avg: 4m 15s | Max: 4m 33s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 53s | Avg:  4m 26s | Max:  4m 33s
      🟩 arm64              Pass: 100%/2   | Total:  8m 09s | Avg:  4m 04s | Max:  4m 19s
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 cxx
      🟩 NVHPC25.5          Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 17m 02s | Avg:  4m 15s | Max:  4m 33s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  8m 39s | Avg:  4m 19s | Max:  4m 20s
      🟩 20                 Pass: 100%/2   | Total:  8m 23s | Avg:  4m 11s | Max:  4m 33s
    
  • 🟩 cccl_c_parallel: Pass: 100%/3 | Total: 24m 42s | Avg: 8m 14s | Max: 12m 16s | Hits: 98%/495

    🟩 cpu
      🟩 amd64              Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 ctk
      🟩 12.9               Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 cxx
      🟩 GCC13              Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/3   | Total: 24m 42s | Avg:  8m 14s | Max: 12m 16s | Hits:  98%/495   
    🟩 gpu
      🟩 h100               Pass: 100%/1   | Total: 12m 16s | Avg: 12m 16s | Max: 12m 16s | Hits:  98%/165   
      🟩 rtx2080            Pass: 100%/2   | Total: 12m 26s | Avg:  6m 13s | Max: 10m 22s | Hits:  98%/330   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s | Hits:  98%/165   
      🟩 Test               Pass: 100%/2   | Total: 22m 38s | Avg: 11m 19s | Max: 12m 16s | Hits:  98%/330   
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
+/- CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 161)

# Runner
93 linux-amd64-cpu16
17 windows-amd64-cpu16
16 linux-amd64-gpu-l4-latest-1
10 linux-arm64-cpu16
9 linux-amd64-gpu-h100-latest-1
7 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-rtxa6000-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

@tpn tpn force-pushed the 5346-add-lto-ptx-opt-for-cuda-coop branch from ef68243 to ebfe17c Compare July 25, 2025 17:40
@github-actions

Copy link
Copy Markdown
Contributor
🟩 CI finished in 1h 07m: Pass: 100%/162 | Total: 1d 03h | Avg: 10m 05s | Max: 34m 00s | Hits: 99%/152553
  • 🟩 cub: Pass: 100%/50 | Total: 10h 55m | Avg: 13m 06s | Max: 33m 33s | Hits: 99%/51848

    🟩 cpu
      🟩 amd64              Pass: 100%/48  | Total: 10h 40m | Avg: 13m 20s | Max: 33m 33s | Hits:  99%/49322 
      🟩 arm64              Pass: 100%/2   | Total: 14m 30s | Avg:  7m 15s | Max:  8m 30s | Hits:  99%/2526  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 59m 30s | Avg: 11m 54s | Max: 29m 22s | Hits:  99%/6211  
      🟩 12.9               Pass: 100%/45  | Total:  9h 55m | Avg: 13m 14s | Max: 33m 33s | Hits:  99%/45637 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 10m 26s | Avg:  5m 13s | Max:  5m 17s | Hits:  99%/2175  
      🟩 nvcc12.0           Pass: 100%/5   | Total: 59m 30s | Avg: 11m 54s | Max: 29m 22s | Hits:  99%/6211  
      🟩 nvcc12.9           Pass: 100%/43  | Total:  9h 45m | Avg: 13m 36s | Max: 33m 33s | Hits:  99%/43462 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 10m 26s | Avg:  5m 13s | Max:  5m 17s | Hits:  99%/2175  
      🟩 nvcc               Pass: 100%/48  | Total: 10h 44m | Avg: 13m 25s | Max: 33m 33s | Hits:  99%/49673 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 27m 18s | Avg:  6m 49s | Max:  7m 18s | Hits:  99%/5054  
      🟩 Clang15            Pass: 100%/2   | Total: 13m 51s | Avg:  6m 55s | Max:  6m 59s | Hits:  99%/2523  
      🟩 Clang16            Pass: 100%/2   | Total: 13m 48s | Avg:  6m 54s | Max:  7m 05s | Hits:  99%/2523  
      🟩 Clang17            Pass: 100%/2   | Total: 13m 26s | Avg:  6m 43s | Max:  6m 44s | Hits:  99%/2523  
      🟩 Clang18            Pass: 100%/2   | Total: 13m 43s | Avg:  6m 51s | Max:  6m 56s | Hits:  99%/2523  
      🟩 Clang19            Pass: 100%/7   | Total:  1h 16m | Avg: 10m 52s | Max: 23m 57s | Hits:  99%/5960  
      🟩 GCC7               Pass: 100%/2   | Total: 17m 01s | Avg:  8m 30s | Max:  8m 38s | Hits:  99%/2526  
      🟩 GCC8               Pass: 100%/1   | Total:  9m 22s | Avg:  9m 22s | Max:  9m 22s | Hits:  99%/1263  
      🟩 GCC9               Pass: 100%/2   | Total: 17m 37s | Avg:  8m 48s | Max:  8m 52s | Hits:  99%/2526  
      🟩 GCC10              Pass: 100%/2   | Total: 18m 33s | Avg:  9m 16s | Max:  9m 29s | Hits:  99%/2527  
      🟩 GCC11              Pass: 100%/2   | Total: 18m 05s | Avg:  9m 02s | Max:  9m 23s | Hits:  99%/2523  
      🟩 GCC12              Pass: 100%/2   | Total: 18m 58s | Avg:  9m 29s | Max: 10m 00s | Hits:  99%/2523  
      🟩 GCC13              Pass: 100%/12  | Total:  3h 06m | Avg: 15m 30s | Max: 25m 54s | Hits:  99%/7581  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 00m | Avg: 30m 27s | Max: 31m 32s | Hits:  99%/2316  
      🟩 MSVC14.43          Pass: 100%/4   | Total:  2h 03m | Avg: 30m 59s | Max: 33m 33s | Hits:  99%/4632  
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 26m 18s | Avg: 13m 09s | Max: 13m 37s | Hits:  98%/2325  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  2h 38m | Avg:  8m 19s | Max: 23m 57s | Hits:  99%/21106 
      🟩 GCC                Pass: 100%/23  | Total:  4h 45m | Avg: 12m 25s | Max: 25m 54s | Hits:  99%/21469 
      🟩 MSVC               Pass: 100%/6   | Total:  3h 04m | Avg: 30m 48s | Max: 33m 33s | Hits:  99%/6948  
      🟩 NVHPC              Pass: 100%/2   | Total: 26m 18s | Avg: 13m 09s | Max: 13m 37s | Hits:  98%/2325  
    🟩 gpu
      🟩 h100               Pass: 100%/3   | Total: 52m 40s | Avg: 17m 33s | Max: 24m 17s | Hits:  99%/1264  
      🟩 rtx2080            Pass: 100%/39  | Total:  7h 30m | Avg: 11m 33s | Max: 33m 33s | Hits:  99%/48058 
      🟩 rtxa6000           Pass: 100%/8   | Total:  2h 31m | Avg: 18m 59s | Max: 25m 54s | Hits:  99%/2526  
    🟩 jobs
      🟩 Build              Pass: 100%/42  | Total:  7h 54m | Avg: 11m 17s | Max: 33m 33s | Hits:  99%/51848 
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 23m 59s | Avg: 23m 59s | Max: 23m 59s
      🟩 GraphCapture       Pass: 100%/1   | Total: 15m 20s | Avg: 15m 20s | Max: 15m 20s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 14m | Avg: 24m 42s | Max: 25m 54s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 07m | Avg: 22m 26s | Max: 23m 34s
    🟩 sm
      🟩 90                 Pass: 100%/3   | Total: 52m 40s | Avg: 17m 33s | Max: 24m 17s | Hits:  99%/1264  
      🟩 90;90a             Pass: 100%/2   | Total: 36m 55s | Avg: 18m 27s | Max: 29m 16s | Hits:  99%/2422  
      🟩 100;120            Pass: 100%/2   | Total: 39m 14s | Avg: 19m 37s | Max: 30m 24s | Hits:  99%/2422  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  3h 59m | Avg: 11m 23s | Max: 31m 32s | Hits:  99%/25915 
      🟩 20                 Pass: 100%/29  | Total:  6h 55m | Avg: 14m 20s | Max: 33m 33s | Hits:  99%/25933 
    
  • 🟩 thrust: Pass: 100%/50 | Total: 8h 52m | Avg: 10m 38s | Max: 34m 00s | Hits: 99%/84139

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 13m 52s | Avg:  6m 56s | Max:  8m 28s | Hits:  99%/1914  
    🟩 cpu
      🟩 amd64              Pass: 100%/48  | Total:  8h 40m | Avg: 10m 50s | Max: 34m 00s | Hits:  99%/80312 
      🟩 arm64              Pass: 100%/2   | Total: 11m 55s | Avg:  5m 57s | Max:  6m 48s | Hits:  99%/3827  
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total: 51m 24s | Avg: 10m 16s | Max: 26m 43s | Hits:  99%/9560  
      🟩 12.9               Pass: 100%/45  | Total:  8h 00m | Avg: 10m 40s | Max: 34m 00s | Hits:  99%/74579 
    🟩 cudacxx
      🟩 ClangCUDA19        Pass: 100%/2   | Total: 11m 41s | Avg:  5m 50s | Max:  6m 00s | Hits: 100%/3826  
      🟩 nvcc12.0           Pass: 100%/5   | Total: 51m 24s | Avg: 10m 16s | Max: 26m 43s | Hits:  99%/9560  
      🟩 nvcc12.9           Pass: 100%/43  | Total:  7h 48m | Avg: 10m 54s | Max: 34m 00s | Hits:  99%/70753 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 11m 41s | Avg:  5m 50s | Max:  6m 00s | Hits: 100%/3826  
      🟩 nvcc               Pass: 100%/48  | Total:  8h 40m | Avg: 10m 50s | Max: 34m 00s | Hits:  99%/80313 
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total: 23m 13s | Avg:  5m 48s | Max:  6m 27s | Hits: 100%/7652  
      🟩 Clang15            Pass: 100%/2   | Total: 12m 11s | Avg:  6m 05s | Max:  6m 16s | Hits: 100%/3826  
      🟩 Clang16            Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  6m 06s | Hits: 100%/3826  
      🟩 Clang17            Pass: 100%/2   | Total: 12m 04s | Avg:  6m 02s | Max:  6m 10s | Hits: 100%/3826  
      🟩 Clang18            Pass: 100%/2   | Total: 11m 40s | Avg:  5m 50s | Max:  5m 52s | Hits: 100%/3826  
      🟩 Clang19            Pass: 100%/7   | Total: 39m 49s | Avg:  5m 41s | Max:  6m 41s | Hits: 100%/9565  
      🟩 GCC7               Pass: 100%/2   | Total: 14m 25s | Avg:  7m 12s | Max:  7m 15s | Hits:  99%/3828  
      🟩 GCC8               Pass: 100%/1   | Total:  7m 14s | Avg:  7m 14s | Max:  7m 14s | Hits:  99%/1914  
      🟩 GCC9               Pass: 100%/2   | Total: 13m 59s | Avg:  6m 59s | Max:  7m 14s | Hits:  99%/3828  
      🟩 GCC10              Pass: 100%/2   | Total: 14m 44s | Avg:  7m 22s | Max:  7m 30s | Hits:  99%/3828  
      🟩 GCC11              Pass: 100%/2   | Total: 15m 15s | Avg:  7m 37s | Max:  7m 42s | Hits:  99%/3828  
      🟩 GCC12              Pass: 100%/2   | Total: 16m 28s | Avg:  8m 14s | Max:  8m 59s | Hits:  99%/3828  
      🟩 GCC13              Pass: 100%/11  | Total:  1h 14m | Avg:  6m 45s | Max:  8m 50s | Hits:  99%/13398 
      🟩 MSVC14.29          Pass: 100%/2   | Total: 56m 28s | Avg: 28m 14s | Max: 29m 45s | Hits:  99%/3812  
      🟩 MSVC14.43          Pass: 100%/5   | Total:  2h 24m | Avg: 28m 48s | Max: 34m 00s | Hits:  99%/9530  
      🟩 NVHPC25.5          Pass: 100%/2   | Total:  1h 04m | Avg: 32m 02s | Max: 32m 39s | Hits:  99%/3824  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/19  | Total:  1h 51m | Avg:  5m 50s | Max:  6m 41s | Hits: 100%/32521 
      🟩 GCC                Pass: 100%/22  | Total:  2h 36m | Avg:  7m 06s | Max:  8m 59s | Hits:  99%/34452 
      🟩 MSVC               Pass: 100%/7   | Total:  3h 20m | Avg: 28m 38s | Max: 34m 00s | Hits:  99%/13342 
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 04m | Avg: 32m 02s | Max: 32m 39s | Hits:  99%/3824  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 12m 58s | Avg:  6m 29s | Max:  7m 09s | Hits:  99%/1914  
      🟩 rtx2080            Pass: 100%/38  | Total:  6h 46m | Avg: 10m 41s | Max: 32m 39s | Hits:  99%/72672 
      🟩 rtx4090            Pass: 100%/10  | Total:  1h 52m | Avg: 11m 17s | Max: 34m 00s | Hits:  99%/9553  
    🟩 jobs
      🟩 Build              Pass: 100%/43  | Total:  7h 46m | Avg: 10m 50s | Max: 32m 39s | Hits:  99%/82233 
      🟩 TestCPU            Pass: 100%/3   | Total: 41m 58s | Avg: 13m 59s | Max: 34m 00s | Hits:  99%/1906  
      🟩 TestGPU            Pass: 100%/4   | Total: 24m 00s | Avg:  6m 00s | Max:  7m 09s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 12m 58s | Avg:  6m 29s | Max:  7m 09s | Hits:  99%/1914  
      🟩 90;90a             Pass: 100%/2   | Total: 34m 51s | Avg: 17m 25s | Max: 27m 38s | Hits:  99%/3820  
      🟩 100;120            Pass: 100%/2   | Total: 31m 05s | Avg: 15m 32s | Max: 23m 36s | Hits:  99%/3820  
    🟩 std
      🟩 17                 Pass: 100%/21  | Total:  3h 52m | Avg: 11m 05s | Max: 32m 39s | Hits:  99%/40160 
      🟩 20                 Pass: 100%/27  | Total:  4h 45m | Avg: 10m 34s | Max: 34m 00s | Hits:  99%/42065 
    
  • 🟩 cudax: Pass: 100%/28 | Total: 2h 35m | Avg: 5m 32s | Max: 11m 49s | Hits: 99%/15906

    🟩 cpu
      🟩 amd64              Pass: 100%/24  | Total:  2h 23m | Avg:  5m 57s | Max: 11m 49s | Hits:  99%/13462 
      🟩 arm64              Pass: 100%/4   | Total: 11m 57s | Avg:  2m 59s | Max:  3m 17s | Hits:  99%/2444  
    🟩 ctk
      🟩 12.0               Pass: 100%/3   | Total: 18m 00s | Avg:  6m 00s | Max: 11m 49s | Hits:  99%/1531  
      🟩 12.9               Pass: 100%/25  | Total:  2h 17m | Avg:  5m 28s | Max: 11m 47s | Hits:  99%/14375 
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/3   | Total: 18m 00s | Avg:  6m 00s | Max: 11m 49s | Hits:  99%/1531  
      🟩 nvcc12.9           Pass: 100%/25  | Total:  2h 17m | Avg:  5m 28s | Max: 11m 47s | Hits:  99%/14375 
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/28  | Total:  2h 35m | Avg:  5m 32s | Max: 11m 49s | Hits:  99%/15906 
    🟩 cxx
      🟩 Clang14            Pass: 100%/2   | Total:  6m 05s | Avg:  3m 02s | Max:  3m 13s | Hits: 100%/1224  
      🟩 Clang15            Pass: 100%/1   | Total:  3m 21s | Avg:  3m 21s | Max:  3m 21s | Hits: 100%/611   
      🟩 Clang16            Pass: 100%/1   | Total:  3m 23s | Avg:  3m 23s | Max:  3m 23s | Hits: 100%/611   
      🟩 Clang17            Pass: 100%/1   | Total:  3m 19s | Avg:  3m 19s | Max:  3m 19s | Hits: 100%/611   
      🟩 Clang18            Pass: 100%/1   | Total:  3m 17s | Avg:  3m 17s | Max:  3m 17s | Hits: 100%/611   
      🟩 Clang19            Pass: 100%/4   | Total: 18m 52s | Avg:  4m 43s | Max: 10m 14s | Hits: 100%/2444  
      🟩 GCC10              Pass: 100%/2   | Total:  6m 47s | Avg:  3m 23s | Max:  3m 28s | Hits:  99%/1224  
      🟩 GCC11              Pass: 100%/1   | Total:  3m 54s | Avg:  3m 54s | Max:  3m 54s | Hits:  99%/611   
      🟩 GCC12              Pass: 100%/1   | Total:  4m 01s | Avg:  4m 01s | Max:  4m 01s | Hits:  99%/611   
      🟩 GCC13              Pass: 100%/8   | Total: 40m 49s | Avg:  5m 06s | Max: 10m 59s | Hits:  99%/4888  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 49s | Avg: 11m 49s | Max: 11m 49s | Hits:  95%/309   
      🟩 MSVC14.43          Pass: 100%/3   | Total: 33m 58s | Avg: 11m 19s | Max: 11m 47s | Hits:  95%/933   
      🟩 NVHPC25.5          Pass: 100%/2   | Total: 15m 26s | Avg:  7m 43s | Max:  7m 56s | Hits:  97%/1218  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/10  | Total: 38m 17s | Avg:  3m 49s | Max: 10m 14s | Hits: 100%/6112  
      🟩 GCC                Pass: 100%/12  | Total: 55m 31s | Avg:  4m 37s | Max: 10m 59s | Hits:  99%/7334  
      🟩 MSVC               Pass: 100%/4   | Total: 45m 47s | Avg: 11m 26s | Max: 11m 49s | Hits:  95%/1242  
      🟩 NVHPC              Pass: 100%/2   | Total: 15m 26s | Avg:  7m 43s | Max:  7m 56s | Hits:  97%/1218  
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 11m 57s | Avg:  5m 58s | Max:  8m 50s | Hits:  99%/1222  
      🟩 rtx2080            Pass: 100%/26  | Total:  2h 23m | Avg:  5m 30s | Max: 11m 49s | Hits:  99%/14684 
    🟩 jobs
      🟩 Build              Pass: 100%/25  | Total:  2h 04m | Avg:  4m 59s | Max: 11m 49s | Hits:  99%/14073 
      🟩 Test               Pass: 100%/3   | Total: 30m 03s | Avg: 10m 01s | Max: 10m 59s | Hits:  99%/1833  
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 11m 57s | Avg:  5m 58s | Max:  8m 50s | Hits:  99%/1222  
      🟩 90;90a             Pass: 100%/2   | Total: 14m 31s | Avg:  7m 15s | Max: 10m 46s | Hits:  98%/922   
      🟩 100;120            Pass: 100%/2   | Total: 15m 18s | Avg:  7m 39s | Max: 11m 47s | Hits:  98%/922   
    🟩 std
      🟩 17                 Pass: 100%/3   | Total: 13m 29s | Avg:  4m 29s | Max:  7m 30s | Hits:  98%/1831  
      🟩 20                 Pass: 100%/25  | Total:  2h 21m | Avg:  5m 39s | Max: 11m 49s | Hits:  99%/14075 
    
  • 🟩 python: Pass: 100%/22 | Total: 3h 28m | Avg: 9m 29s | Max: 20m 56s

    🟩 cpu
      🟩 amd64              Pass: 100%/22  | Total:  3h 28m | Avg:  9m 29s | Max: 20m 56s
    🟩 ctk
      🟩 12.5               Pass: 100%/6   | Total: 35m 31s | Avg:  5m 55s | Max:  9m 52s
      🟩 12.8               Pass: 100%/2   | Total: 37m 23s | Avg: 18m 41s | Max: 18m 51s
      🟩 12.9               Pass: 100%/14  | Total:  2h 15m | Avg:  9m 42s | Max: 20m 56s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/6   | Total: 35m 31s | Avg:  5m 55s | Max:  9m 52s
      🟩 nvcc12.8           Pass: 100%/2   | Total: 37m 23s | Avg: 18m 41s | Max: 18m 51s
      🟩 nvcc12.9           Pass: 100%/14  | Total:  2h 15m | Avg:  9m 42s | Max: 20m 56s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/22  | Total:  3h 28m | Avg:  9m 29s | Max: 20m 56s
    🟩 cxx
      🟩 GCC13              Pass: 100%/22  | Total:  3h 28m | Avg:  9m 29s | Max: 20m 56s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/22  | Total:  3h 28m | Avg:  9m 29s | Max: 20m 56s
    🟩 gpu
      🟩 h100               Pass: 100%/4   | Total: 42m 23s | Avg: 10m 35s | Max: 20m 56s
      🟩 l4                 Pass: 100%/18  | Total:  2h 46m | Avg:  9m 14s | Max: 18m 51s
    🟩 jobs
      🟩 Build cuda.cccl    Pass: 100%/2   | Total: 19m 46s | Avg:  9m 53s | Max: 10m 26s
      🟩 Test cuda.cccl.cooperative Pass: 100%/5   | Total: 51m 33s | Avg: 10m 18s | Max: 11m 59s
      🟩 Test cuda.cccl.examples Pass: 100%/5   | Total: 22m 43s | Avg:  4m 32s | Max:  5m 42s
      🟩 Test cuda.cccl.headers Pass: 100%/5   | Total: 19m 10s | Avg:  3m 50s | Max:  3m 59s
      🟩 Test cuda.cccl.parallel Pass: 100%/5   | Total:  1h 35m | Avg: 19m 07s | Max: 20m 56s
    🟩 py_version
      🟩 3.10               Pass: 100%/9   | Total:  1h 24m | Avg:  9m 21s | Max: 18m 51s
      🟩 3.13               Pass: 100%/13  | Total:  2h 04m | Avg:  9m 34s | Max: 20m 56s
    
  • 🟩 cccl_c_parallel: Pass: 100%/4 | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits: 98%/660

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 cxx
      🟩 GCC13              Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 cxx_family
      🟩 GCC                Pass: 100%/4   | Total: 54m 42s | Avg: 13m 40s | Max: 20m 17s | Hits:  98%/660   
    🟩 gpu
      🟩 h100               Pass: 100%/1   | Total: 20m 17s | Avg: 20m 17s | Max: 20m 17s | Hits:  98%/165   
      🟩 l4                 Pass: 100%/1   | Total: 17m 10s | Avg: 17m 10s | Max: 17m 10s | Hits:  98%/165   
      🟩 rtx2080            Pass: 100%/2   | Total: 17m 15s | Avg:  8m 37s | Max: 15m 11s | Hits:  98%/330   
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s | Hits:  98%/165   
      🟩 Test               Pass: 100%/3   | Total: 52m 38s | Avg: 17m 32s | Max: 20m 17s | Hits:  98%/495   
    
  • 🟩 packaging: Pass: 100%/4 | Total: 12m 17s | Avg: 3m 04s | Max: 3m 31s

    🟩 cpu
      🟩 amd64              Pass: 100%/4   | Total: 12m 17s | Avg:  3m 04s | Max:  3m 31s
    🟩 ctk
      🟩 12.0               Pass: 100%/2   | Total:  5m 41s | Avg:  2m 50s | Max:  3m 18s
      🟩 12.9               Pass: 100%/2   | Total:  6m 36s | Avg:  3m 18s | Max:  3m 31s
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/2   | Total:  5m 41s | Avg:  2m 50s | Max:  3m 18s
      🟩 nvcc12.9           Pass: 100%/2   | Total:  6m 36s | Avg:  3m 18s | Max:  3m 31s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 12m 17s | Avg:  3m 04s | Max:  3m 31s
    🟩 cxx
      🟩 Clang14            Pass: 100%/1   | Total:  2m 23s | Avg:  2m 23s | Max:  2m 23s
      🟩 Clang19            Pass: 100%/1   | Total:  3m 31s | Avg:  3m 31s | Max:  3m 31s
      🟩 GCC12              Pass: 100%/1   | Total:  3m 18s | Avg:  3m 18s | Max:  3m 18s
      🟩 GCC13              Pass: 100%/1   | Total:  3m 05s | Avg:  3m 05s | Max:  3m 05s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/2   | Total:  5m 54s | Avg:  2m 57s | Max:  3m 31s
      🟩 GCC                Pass: 100%/2   | Total:  6m 23s | Avg:  3m 11s | Max:  3m 18s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 12m 17s | Avg:  3m 04s | Max:  3m 31s
    🟩 jobs
      🟩 Test               Pass: 100%/4   | Total: 12m 17s | Avg:  3m 04s | Max:  3m 31s
    
  • 🟩 stdpar: Pass: 100%/4 | Total: 15m 52s | Avg: 3m 58s | Max: 4m 11s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  8m 02s | Avg:  4m 01s | Max:  4m 11s
      🟩 arm64              Pass: 100%/2   | Total:  7m 50s | Avg:  3m 55s | Max:  4m 11s
    🟩 ctk
      🟩 12.9               Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 cudacxx
      🟩 nvcc12.9           Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 cxx
      🟩 NVHPC25.5          Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 cxx_family
      🟩 NVHPC              Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 gpu
      🟩 rtx2080            Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 jobs
      🟩 Build              Pass: 100%/4   | Total: 15m 52s | Avg:  3m 58s | Max:  4m 11s
    🟩 std
      🟩 17                 Pass: 100%/2   | Total:  7m 30s | Avg:  3m 45s | Max:  3m 51s
      🟩 20                 Pass: 100%/2   | Total:  8m 22s | Avg:  4m 11s | Max:  4m 11s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
CCCL Packaging
libcu++
+/- CUB
Thrust
CUDA Experimental
stdpar
+/- python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- CCCL Packaging
libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- stdpar
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 162)

# Runner
93 linux-amd64-cpu16
17 linux-amd64-gpu-l4-latest-1
17 windows-amd64-cpu16
10 linux-arm64-cpu16
9 linux-amd64-gpu-h100-latest-1
7 linux-amd64-gpu-rtx2080-latest-1
6 linux-amd64-gpu-rtxa6000-latest-1
3 linux-amd64-gpu-rtx4090-latest-1

Comment thread python/cuda_cccl/pyproject.toml Outdated
requires-python = ">=3.9"
dependencies = [
"numba>=0.60.0",
"numba-cuda>=0.16.0",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigate: do we still need numba here? Can we just depend on numba-cuda bringing in the right version?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let's add a note so it doesn't keep coming up: #5613 (comment)

…onversion.

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)

N.B. Depends on a very recent numba-cuda change by @isVoid:
     NVIDIA/numba-cuda#326.  I don't think a
     branch has been cut with this change yet, so... we'll need to wait
     for that before we can pin an appropriate version and test this in
     CI.
@tpn tpn force-pushed the 5346-add-lto-ptx-opt-for-cuda-coop branch from ebfe17c to bfaeb24 Compare September 2, 2025 21:24
@github-actions

github-actions Bot commented Sep 3, 2025

Copy link
Copy Markdown
Contributor

🥳 CI Workflow Results

🟩 Finished in 2h 39m: Pass: 100%/22 | Total: 3h 42m | Max: 21m 55s

See results here.

if self._temp_storage_alignment is None:
raise RuntimeError(
"Temporary storage alignment not computed yet. "
"Call get_lto_ir() first."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why not call get_lto_ir() on the user's behalf instead of raising here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's essentially an internal invariant... users wouldn't be calling this directly in a way where they could hit this. But we might accidentally trip the invariant as part of library development (i.e. adding a new primitive), so -> fail fast.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess put differently, is there any disadvantage to making get_lto_ir() an implementation detail, and simply exposing these (cached) properties? The "user" here is us, i.e., library developers.

Either ways, I'm going ahead and approving. Leaving it to your best judgement here!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I actually had it as a cached property like that last time, but switched to using the underscore + property here as we'll be adding more of these in single-phase.

@tpn tpn merged commit 7e8af72 into NVIDIA:main Sep 3, 2025
42 checks passed
@github-project-automation github-project-automation Bot moved this from In Review to Done in CCCL Sep 3, 2025
davebayer pushed a commit to davebayer/cccl that referenced this pull request Sep 23, 2025
…onversion. (NVIDIA#5355)

This obviates the need for a much more expensive separate PTX compilation
step just to get the same information.  This reduces the overhead of a
given primitive call by over half.  (`test_block_exchange.py` went from
1m 23s to about 33s with this change in place, for example.)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants