Skip to content

Integrate with CUDA.jl + reduce launch overhead#214

Merged
maleadt merged 15 commits into
mainfrom
tb/cuda_integration
May 1, 2026
Merged

Integrate with CUDA.jl + reduce launch overhead#214
maleadt merged 15 commits into
mainfrom
tb/cuda_integration

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 30, 2026

julia> @benchmark @cuda identity(nothing)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.495 μs …  5.160 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.554 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.559 μs ± 52.910 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▁▃█▄▃▁
  ▁▁▁▁▁▁▁▁▂▂▂▃▃▃▄▅▇██████▇▅▃▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.49 μs        Histogram: frequency by time        1.67 μs <

 Memory estimate: 112 bytes, allocs estimate: 3.
julia> @benchmark @cuda backend=cuTile identity(nothing)
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.624 μs …  10.826 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.668 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.685 μs ± 192.335 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▅█▁▂ ▁▁
  ▁▁▂▂▃▅████████▆██▇▄▃▃▂▃▂▃▄▃▃▂▃▂▂▂▁▂▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.62 μs         Histogram: frequency by time        1.84 μs <

 Memory estimate: 272 bytes, allocs estimate: 8.

🚀

maleadt and others added 15 commits April 30, 2026 14:09
The package extension was effectively a fiction — every functional path
(launch, broadcast, RNG, examples, all host/device tests) requires
CUDACore. Codegen-only paths still work, but they live behind
CUDACore-aware code anyway.

Moves the ext contents into:
- src/launch.jl       — launch infrastructure
- src/broadcast.jl    — TiledStyle vs CuArrayStyle override
- src/language/random.jl — host-side rand!/randn!/randexp! on CuArray

Removes the `launch(args...) = error("Please import CUDA.jl …")` stub.
Replaces the bespoke `to_tile_arg` flat dispatch with a `KernelAdaptor`
+ `cuTileconvert` mirroring `CUDACore.cudaconvert`. User-defined struct
types containing arrays now compose naturally — register an
`Adapt.adapt_structure` method and they're recursively converted before
launch.

Adapt's default `adapt_structure(to, ::PermutedDimsArray)` rebuilds the
wrapper around the adapted parent, which doesn't fit `TileArray`'s
absorb-strides-into-the-struct design. Added an explicit override so the
whole wrapper collapses to a single TileArray.
Wires cuTile into CUDACore's `@cuda backend=…` dispatch protocol and
introduces a hoistable `cufunction(f, tt)` step that mirrors
`CUDACore.cufunction`. Three user-facing entry points now exist:

  cuTile.launch(f, grid, args…; opts…)              # function form
  @cuda backend=cuTile.TileBackend() blocks=N f(…)  # macro form
  k = cuTile.cufunction(f, tt; opts…); k(args…; …)  # pre-compiled

The pre-compiled form skips MI lookup and CompilerCaching dispatch on
each call — useful for tight launch loops.

Caching of the `TileKernel` wrapper rides on CompilerCaching: a new
`tile_kernel` field on `CuTileResults` stores it alongside the cubin and
CuFunction, so kernel-instance lifecycle follows the underlying
`CodeInstance` instead of a separate global Dict.

Other changes:
- `check_tile_ir_support()` is memoized per `(capability, cuda_version)`
  instead of running every launch.
- `kernel_convert(::TileBackend, x) = cuTileconvert(x)` and
  `kernel_compile(::TileBackend, f, tt; …) = cufunction(f, tt; …)`
  register cuTile as a backend with CUDACore.
Each emit phase used to do its own `get(cache, mi)` + `results(cache, ci, ...)`,
so a single launch resolved the same `(ci, res)` pair four times. With
CompilerCaching's new `lookup` accessor and an `ensure_compiled` helper,
`cufunction` resolves them once at the top and threads them down through
`emit_function!` → `emit_binary!` → `emit_tile!` → `emit_structured!`.

The cache short-circuit in each phase moves to *after* the recurse so
`compile_hook` (used by `@device_code_*` reflection) still fires on every
launch when downstream artifacts are fully cached.

Closes ~1.4 µs / 9 allocations off the function-form launch overhead on a
vadd kernel (9.6 µs → 8.2 µs).
…rated`.

The original three passes over `tt.parameters` (`any` for `has_consts`, `map`
for the unwrapped tuple, then a `for ... push!` for the const-argtypes vector)
cost ~1 µs/launch on a vadd kernel. With `tt` `@nospecialize`d, none of it
const-folded.

Drop the `@nospecialize(tt)` so `cufunction` specializes per kernel signature,
and replace the body with a single `@generated` `unwrap_argtypes` that:
- emits the `Tuple{...}` argtypes type at compile time (no allocation),
- emits the `Any[...]` const-argtypes literal so only the runtime CC.Const
  boxes for argument values survive,
- short-circuits to `nothing` when `tt` has no `Constant` slots, skipping
  the const-prop pipeline entirely.

cufunction time on a vadd kernel: 1.85 µs → 1.14 µs → ~700 ns. Total launch
gap (function-form vs pre-compiled) closed from 4.4 µs to 1.6 µs.
- `DefaultBackend() = TileBackend()` so `@cuda backend=cuTile …`
  resolves through CUDACore's module-as-backend hook.
- Default `blocks=1` on `(::TileKernel)(args...; …)` to match
  CUDACore's driver-level `launch`/`cudacall`, so a no-grid
  `@cuda backend=cuTile f(args...)` works.
- `public` list now includes `TileBackend` and `DefaultBackend`;
  doc/example strings lead with the module-form invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`Base.rand(Random.RandomDevice(), UInt32)` costs ~110 ns; the task-local
`Base.rand(UInt32)` is ~1.5 ns and produces equally distinct seeds across
launches. Mirrors what `HostKernel.make_seed` does on the LLVM backend.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous body built `flat_args` via `Iterators.flatten(map(flatten, args))...`
and `flat_types` via `Tuple{map(typeof, flat_args)...}`. Inference widens both
to abstract tuple/type lattice elements and lowers to `Core._apply_iterate`,
costing ~400 ns of splat overhead per launch. Mirrors the `@generated` callable
on CUDACore's `AbstractKernel`, which folds the same work at compile time.

`_flatten_static!` recurses over the arg types: TileArray expands to
`(ptr, sizes..., strides...)`, ghost types contribute nothing, primitives pass
through, structs recurse field-by-field via `getfield(_, i)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous owner `(:cuTile, opts::NamedTuple)` carried two `VersionNumber`
fields, whose non-isbits prerelease/build tuples forced the per-`lookup` heap
box up to ~80 B / ~17 ns. Pack `sm_arch` and `bytecode_version` into `UInt16`
and use `-1` as the `nothing` sentinel for hint fields, so the owner is
isbits — box drops to ~32 B / ~6 ns. Decoding back to the original
`VersionNumber` / `Union{Int, Nothing}` types only happens once per cache miss
in `emit_binary!` and `emit_tile!`, never on the hot lookup path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt merged commit 9dfad0b into main May 1, 2026
1 check passed
@maleadt maleadt deleted the tb/cuda_integration branch May 1, 2026 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant