Obtain temp storage size and alignment directly from LTO IR via PTX conversion.#5355
Conversation
🟨 CI finished in 37m 29s: Pass: 77%/22 | Total: 2h 58m | Avg: 8m 07s | Max: 21m 09s
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| CUB | |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
🏃 Runner counts (total jobs: 22)
| # | Runner |
|---|---|
| 16 | linux-amd64-gpu-l4-latest-1 |
| 4 | linux-amd64-gpu-h100-latest-1 |
| 2 | linux-amd64-cpu16 |
|
@tpn I don't think this PR depends on NVIDIA/numba-cuda#326. You are directly using Numba's linker binding to get the ptx from ltoir, which was introduced in the cuda-python linker work in NVIDIA/numba-cuda#133, this was cut in 0.16.0. |
|
Ah looks like the issue is lack of Which... looks like it was added 9 months ago. Weird. |
31fe5f6 to
10704ef
Compare
|
Turns out we weren't even using numba-cuda in CI! Just old-school |
🟨 CI finished in 1h 22m: Pass: 99%/161 | Total: 1d 09h | Avg: 12m 28s | Max: 1h 20m | Hits: 97%/152388
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 161)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | windows-amd64-cpu16 |
| 16 | linux-amd64-gpu-l4-latest-1 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
ef68243 to
ebfe17c
Compare
🟩 CI finished in 1h 07m: Pass: 100%/162 | Total: 1d 03h | Avg: 10m 05s | Max: 34m 00s | Hits: 99%/152553
|
| Project | |
|---|---|
| CCCL Infrastructure | |
| CCCL Packaging | |
| libcu++ | |
| +/- | CUB |
| Thrust | |
| CUDA Experimental | |
| stdpar | |
| +/- | python |
| CCCL C Parallel Library | |
| Catch2Helper |
Modifications in project or dependencies?
| Project | |
|---|---|
| CCCL Infrastructure | |
| +/- | CCCL Packaging |
| libcu++ | |
| +/- | CUB |
| +/- | Thrust |
| +/- | CUDA Experimental |
| +/- | stdpar |
| +/- | python |
| +/- | CCCL C Parallel Library |
| +/- | Catch2Helper |
🏃 Runner counts (total jobs: 162)
| # | Runner |
|---|---|
| 93 | linux-amd64-cpu16 |
| 17 | linux-amd64-gpu-l4-latest-1 |
| 17 | windows-amd64-cpu16 |
| 10 | linux-arm64-cpu16 |
| 9 | linux-amd64-gpu-h100-latest-1 |
| 7 | linux-amd64-gpu-rtx2080-latest-1 |
| 6 | linux-amd64-gpu-rtxa6000-latest-1 |
| 3 | linux-amd64-gpu-rtx4090-latest-1 |
| requires-python = ">=3.9" | ||
| dependencies = [ | ||
| "numba>=0.60.0", | ||
| "numba-cuda>=0.16.0", |
There was a problem hiding this comment.
Investigate: do we still need numba here? Can we just depend on numba-cuda bringing in the right version?
There was a problem hiding this comment.
Maybe let's add a note so it doesn't keep coming up: #5613 (comment)
…onversion. This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.) N.B. Depends on a very recent numba-cuda change by @isVoid: NVIDIA/numba-cuda#326. I don't think a branch has been cut with this change yet, so... we'll need to wait for that before we can pin an appropriate version and test this in CI.
ebfe17c to
bfaeb24
Compare
🥳 CI Workflow Results🟩 Finished in 2h 39m: Pass: 100%/22 | Total: 3h 42m | Max: 21m 55sSee results here. |
| if self._temp_storage_alignment is None: | ||
| raise RuntimeError( | ||
| "Temporary storage alignment not computed yet. " | ||
| "Call get_lto_ir() first." |
There was a problem hiding this comment.
Question: why not call get_lto_ir() on the user's behalf instead of raising here?
There was a problem hiding this comment.
It's essentially an internal invariant... users wouldn't be calling this directly in a way where they could hit this. But we might accidentally trip the invariant as part of library development (i.e. adding a new primitive), so -> fail fast.
There was a problem hiding this comment.
I guess put differently, is there any disadvantage to making get_lto_ir() an implementation detail, and simply exposing these (cached) properties? The "user" here is us, i.e., library developers.
Either ways, I'm going ahead and approving. Leaving it to your best judgement here!
There was a problem hiding this comment.
Yeah I actually had it as a cached property like that last time, but switched to using the underscore + property here as we'll be adding more of these in single-phase.
…onversion. (NVIDIA#5355) This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (`test_block_exchange.py` went from 1m 23s to about 33s with this change in place, for example.)
This obviates the need for a much more expensive separate PTX compilation step just to get the same information. This reduces the overhead of a given primitive call by over half. (
test_block_exchange.pywent from 1m 23s to about 33s with this change in place, for example.)