Skip to content

Switch to CompilerCaching.jl#777

Draft
maleadt wants to merge 15 commits into
mainfrom
tb/compilercaching
Draft

Switch to CompilerCaching.jl#777
maleadt wants to merge 15 commits into
mainfrom
tb/compilercaching

Conversation

@maleadt

@maleadt maleadt commented May 12, 2026

Copy link
Copy Markdown
Member

maleadt and others added 15 commits July 3, 2026 07:10
Replace `cached_compilation` with a `MetalResults` struct attached to
each `CodeInstance` via `CompilerCaching`: `metallib` + entry name are
session-portable (cached through precompilation), and the
`MTLComputePipelineState` is materialized lazily per session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `bitcode` field to `MetalResults` and overrides
`GPUCompiler.bitcode` / `bitcode!`. Per-function runtime library bitcode
now rides on the same precompilation path as `metallib`/`entry`, so
cross-session loads can skip the runtime rebuild.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single `pipeline` slot on `MetalResults` with a small linear
cache of `(MTLDevice, MTLComputePipelineState)` pairs. The cache partition
already covers the macOS / AIR / Metal versions that affect codegen, but
two `MTLDevice`s on a single Mac (e.g. integrated + discrete) share the
same `metallib` and need separate `MTLComputePipelineState`s.

Hot-path cost is unchanged: one field load + one `===` compare. The
common case (single device) stays at n=1.

`link_pipeline` now takes the target `MTLDevice` explicitly instead of
calling `device()` internally, so the call site captures the device once
under `mtlfunction_lock`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`something(lookup(...), compile_metal!(...))` evaluated `compile_metal!`
eagerly even on a cache hit, so every kernel launch silently re-ran the
full LLVM compile pipeline. Branch explicitly on the lookup result.

Warm-cache `mtlfunction` cost: ~3.4 ms → ~380 ns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `bitcode_opaque` / `bitcode_typed` fields to `MetalResults` and
overrides the new `GPUCompiler.bitcode` / `bitcode!` trait pair. Runtime
functions (`gpu_malloc`, `gpu_report_exception`, …) now persist their
post-irgen renamed LLVM bitcode on their own `CodeInstance` on 1.11+,
which carries through package precompilation. 1.10 keeps falling back
to the session-local `_runtime_libs` cache.

The two slots reflect that opaque-pointer and typed-pointer LLVM IR
aren't interchangeable. In practice modern LLVM always uses opaque
pointers; the typed slot exists for older Julia/LLVM combinations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches GPUCompiler's simplified `bitcode`/`bitcode!` trait pair: LLVM's
pointer mode is fixed across a precompile/load pair of the same Julia
version, so one slot per `CodeInstance` is enough.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-hit path in compile_or_lookup just needs to distinguish a CI
that completed compilation from one that was only inferred as a callee.
A direct r.metallib !== nothing check at the single call site is clearer
than routing through a one-line trait override.

Also switch to GPUCompiler.cache_view(job) instead of constructing the
CacheView manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ut!.

`MetalResults` now keeps every kernel pipeline stage (LLVM IR bytes, AIR, the
final metallib) instead of just the last one. The intermediates were already
being computed in `compile_to_metallib` and then discarded; storing them is
~free and lets reflection dump any phase post-hoc without re-running the
compile.

The single byte slot is named `llvm_ir` (not `bitcode`) because AIR is also
a form of bitcode — the field always holds the LLVM-stage output. For
runtime-library function CIs it's the final artifact; for kernel CIs it's
an intermediate before the AIR downgrade. The content is binary bitcode
(write(io, mod)), ~10× smaller than textual IR. Since each `MetalResults`
is attached to a single CI — either a kernel MI or a runtime-function MI —
the two roles never share a slot, so one field covers both cleanly.

`cache_get` / `cache_put!` overrides on `MetalCompilerJob` implement
GPUCompiler's new caching protocol, routing the `:llvm_ir` key through a
CompilerCaching lookup to the relevant CI's `MetalResults.llvm_ir`.

`compile_to_metallib` now returns a NamedTuple of all artifacts; the caller
(`compile_metal!` on 1.11+, `_legacy_link` on 1.10) hands them onto the
appropriate `MetalResults`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GPUCompiler 2.0 now mediates all interaction with the compilation cache
through a single version-agnostic entry point, cached_results, so Metal no
longer needs to depend on CompilerCaching, fork its compile path on the
Julia version, or implement the cache_get/cache_put! protocol for runtime
library functions (GPUCompiler caches those itself now). This restores
Julia 1.10 support: the same compile_or_lookup runs against the integrated
code cache on 1.11+ and against GPUCompiler's session-local store on 1.10.

MetalResults loses its llvm_ir slot — it only existed to serve the runtime
library protocol — and keeps air/metallib/entry (session-portable) plus the
per-device pipeline cache (session-local).

The precompilation workload now compiles and links an actual kernel again:
on 1.11+ the inference results and metallib bytes are stored in the package
image (attached to the kernel's CodeInstance), so a fresh session can launch
it without invoking the compiler — verified to hit the image, link in
~0.2s, and keep ObjectiveC handles out (mtlfunction skips the session-local
pipeline cache while generating output). Launching the kernel during the
workload is not possible: committing a command buffer and waiting for it
hangs during precompilation.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@maleadt maleadt force-pushed the tb/compilercaching branch from 00d69c1 to 90d2673 Compare July 3, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant