Integrate with CUDA.jl + reduce launch overhead by maleadt · Pull Request #214 · JuliaGPU/cuTile.jl

maleadt · 2026-04-30T13:48:02Z

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.495 μs …  5.160 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.554 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.559 μs ± 52.910 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▃█▄▃▁
  ▁▁▁▁▁▁▁▁▂▂▂▃▃▃▄▅▇██████▇▅▃▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.49 μs        Histogram: frequency by time        1.67 μs <

 Memory estimate: 112 bytes, allocs estimate: 3.

julia> @benchmark @cuda backend=cuTile identity(nothing)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.624 μs …  10.826 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.668 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.685 μs ± 192.335 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▅█▁▂ ▁▁
  ▁▁▂▂▃▅████████▆██▇▄▃▃▂▃▂▃▄▃▃▂▃▂▂▂▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.62 μs         Histogram: frequency by time        1.84 μs <

 Memory estimate: 272 bytes, allocs estimate: 8.

🚀

The package extension was effectively a fiction — every functional path (launch, broadcast, RNG, examples, all host/device tests) requires CUDACore. Codegen-only paths still work, but they live behind CUDACore-aware code anyway. Moves the ext contents into: - src/launch.jl — launch infrastructure - src/broadcast.jl — TiledStyle vs CuArrayStyle override - src/language/random.jl — host-side rand!/randn!/randexp! on CuArray Removes the `launch(args...) = error("Please import CUDA.jl …")` stub.

Replaces the bespoke `to_tile_arg` flat dispatch with a `KernelAdaptor` + `cuTileconvert` mirroring `CUDACore.cudaconvert`. User-defined struct types containing arrays now compose naturally — register an `Adapt.adapt_structure` method and they're recursively converted before launch. Adapt's default `adapt_structure(to, ::PermutedDimsArray)` rebuilds the wrapper around the adapted parent, which doesn't fit `TileArray`'s absorb-strides-into-the-struct design. Added an explicit override so the whole wrapper collapses to a single TileArray.

@cuda

Wires cuTile into CUDACore's `@cuda backend=…` dispatch protocol and introduces a hoistable `cufunction(f, tt)` step that mirrors `CUDACore.cufunction`. Three user-facing entry points now exist: cuTile.launch(f, grid, args…; opts…) # function form @cuda backend=cuTile.TileBackend() blocks=N f(…) # macro form k = cuTile.cufunction(f, tt; opts…); k(args…; …) # pre-compiled The pre-compiled form skips MI lookup and CompilerCaching dispatch on each call — useful for tight launch loops. Caching of the `TileKernel` wrapper rides on CompilerCaching: a new `tile_kernel` field on `CuTileResults` stores it alongside the cubin and CuFunction, so kernel-instance lifecycle follows the underlying `CodeInstance` instead of a separate global Dict. Other changes: - `check_tile_ir_support()` is memoized per `(capability, cuda_version)` instead of running every launch. - `kernel_convert(::TileBackend, x) = cuTileconvert(x)` and `kernel_compile(::TileBackend, f, tt; …) = cufunction(f, tt; …)` register cuTile as a backend with CUDACore.

Each emit phase used to do its own `get(cache, mi)` + `results(cache, ci, ...)`, so a single launch resolved the same `(ci, res)` pair four times. With CompilerCaching's new `lookup` accessor and an `ensure_compiled` helper, `cufunction` resolves them once at the top and threads them down through `emit_function!` → `emit_binary!` → `emit_tile!` → `emit_structured!`. The cache short-circuit in each phase moves to *after* the recurse so `compile_hook` (used by `@device_code_*` reflection) still fires on every launch when downstream artifacts are fully cached. Closes ~1.4 µs / 9 allocations off the function-form launch overhead on a vadd kernel (9.6 µs → 8.2 µs).

…rated`. The original three passes over `tt.parameters` (`any` for `has_consts`, `map` for the unwrapped tuple, then a `for ... push!` for the const-argtypes vector) cost ~1 µs/launch on a vadd kernel. With `tt` `@nospecialize`d, none of it const-folded. Drop the `@nospecialize(tt)` so `cufunction` specializes per kernel signature, and replace the body with a single `@generated` `unwrap_argtypes` that: - emits the `Tuple{...}` argtypes type at compile time (no allocation), - emits the `Any[...]` const-argtypes literal so only the runtime CC.Const boxes for argument values survive, - short-circuits to `nothing` when `tt` has no `Constant` slots, skipping the const-prop pipeline entirely. cufunction time on a vadd kernel: 1.85 µs → 1.14 µs → ~700 ns. Total launch gap (function-form vs pre-compiled) closed from 4.4 µs to 1.6 µs.

- `DefaultBackend() = TileBackend()` so `@cuda backend=cuTile …` resolves through CUDACore's module-as-backend hook. - Default `blocks=1` on `(::TileKernel)(args...; …)` to match CUDACore's driver-level `launch`/`cudacall`, so a no-grid `@cuda backend=cuTile f(args...)` works. - `public` list now includes `TileBackend` and `DefaultBackend`; doc/example strings lead with the module-form invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`Base.rand(Random.RandomDevice(), UInt32)` costs ~110 ns; the task-local `Base.rand(UInt32)` is ~1.5 ns and produces equally distinct seeds across launches. Mirrors what `HostKernel.make_seed` does on the LLVM backend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous body built `flat_args` via `Iterators.flatten(map(flatten, args))...` and `flat_types` via `Tuple{map(typeof, flat_args)...}`. Inference widens both to abstract tuple/type lattice elements and lowers to `Core._apply_iterate`, costing ~400 ns of splat overhead per launch. Mirrors the `@generated` callable on CUDACore's `AbstractKernel`, which folds the same work at compile time. `_flatten_static!` recurses over the arg types: TileArray expands to `(ptr, sizes..., strides...)`, ghost types contribute nothing, primitives pass through, structs recurse field-by-field via `getfield(_, i)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous owner `(:cuTile, opts::NamedTuple)` carried two `VersionNumber` fields, whose non-isbits prerelease/build tuples forced the per-`lookup` heap box up to ~80 B / ~17 ns. Pack `sm_arch` and `bytecode_version` into `UInt16` and use `-1` as the `nothing` sentinel for hint fields, so the owner is isbits — box drops to ~32 B / ~6 ns. Decoding back to the original `VersionNumber` / `Union{Int, Nothing}` types only happens once per cache miss in `emit_binary!` and `emit_tile!`, never on the hot lookup path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maleadt and others added 15 commits April 30, 2026 14:09

Handle nospecialize'd kernels.

3ff3584

Optimize check_tile_ir_support.

0cd9109

Migrate all kernel launches.

82dd6fc

Bump dependencies.

6802600

Remove GH:A CI.

f1e2555

Simplify CI.

6581cea

maleadt merged commit 9dfad0b into main May 1, 2026
1 check passed

maleadt deleted the tb/cuda_integration branch May 1, 2026 06:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate with CUDA.jl + reduce launch overhead#214

Integrate with CUDA.jl + reduce launch overhead#214
maleadt merged 15 commits into
mainfrom
tb/cuda_integration

maleadt commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maleadt commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant