Switch to CompilerCaching.jl#777
Draft
maleadt wants to merge 15 commits into
Draft
Conversation
Replace `cached_compilation` with a `MetalResults` struct attached to each `CodeInstance` via `CompilerCaching`: `metallib` + entry name are session-portable (cached through precompilation), and the `MTLComputePipelineState` is materialized lazily per session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a `bitcode` field to `MetalResults` and overrides `GPUCompiler.bitcode` / `bitcode!`. Per-function runtime library bitcode now rides on the same precompilation path as `metallib`/`entry`, so cross-session loads can skip the runtime rebuild. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the single `pipeline` slot on `MetalResults` with a small linear cache of `(MTLDevice, MTLComputePipelineState)` pairs. The cache partition already covers the macOS / AIR / Metal versions that affect codegen, but two `MTLDevice`s on a single Mac (e.g. integrated + discrete) share the same `metallib` and need separate `MTLComputePipelineState`s. Hot-path cost is unchanged: one field load + one `===` compare. The common case (single device) stays at n=1. `link_pipeline` now takes the target `MTLDevice` explicitly instead of calling `device()` internally, so the call site captures the device once under `mtlfunction_lock`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`something(lookup(...), compile_metal!(...))` evaluated `compile_metal!` eagerly even on a cache hit, so every kernel launch silently re-ran the full LLVM compile pipeline. Branch explicitly on the lookup result. Warm-cache `mtlfunction` cost: ~3.4 ms → ~380 ns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `bitcode_opaque` / `bitcode_typed` fields to `MetalResults` and overrides the new `GPUCompiler.bitcode` / `bitcode!` trait pair. Runtime functions (`gpu_malloc`, `gpu_report_exception`, …) now persist their post-irgen renamed LLVM bitcode on their own `CodeInstance` on 1.11+, which carries through package precompilation. 1.10 keeps falling back to the session-local `_runtime_libs` cache. The two slots reflect that opaque-pointer and typed-pointer LLVM IR aren't interchangeable. In practice modern LLVM always uses opaque pointers; the typed slot exists for older Julia/LLVM combinations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Matches GPUCompiler's simplified `bitcode`/`bitcode!` trait pair: LLVM's pointer mode is fixed across a precompile/load pair of the same Julia version, so one slot per `CodeInstance` is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cache-hit path in compile_or_lookup just needs to distinguish a CI that completed compilation from one that was only inferred as a callee. A direct r.metallib !== nothing check at the single call site is clearer than routing through a one-line trait override. Also switch to GPUCompiler.cache_view(job) instead of constructing the CacheView manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ut!. `MetalResults` now keeps every kernel pipeline stage (LLVM IR bytes, AIR, the final metallib) instead of just the last one. The intermediates were already being computed in `compile_to_metallib` and then discarded; storing them is ~free and lets reflection dump any phase post-hoc without re-running the compile. The single byte slot is named `llvm_ir` (not `bitcode`) because AIR is also a form of bitcode — the field always holds the LLVM-stage output. For runtime-library function CIs it's the final artifact; for kernel CIs it's an intermediate before the AIR downgrade. The content is binary bitcode (write(io, mod)), ~10× smaller than textual IR. Since each `MetalResults` is attached to a single CI — either a kernel MI or a runtime-function MI — the two roles never share a slot, so one field covers both cleanly. `cache_get` / `cache_put!` overrides on `MetalCompilerJob` implement GPUCompiler's new caching protocol, routing the `:llvm_ir` key through a CompilerCaching lookup to the relevant CI's `MetalResults.llvm_ir`. `compile_to_metallib` now returns a NamedTuple of all artifacts; the caller (`compile_metal!` on 1.11+, `_legacy_link` on 1.10) hands them onto the appropriate `MetalResults`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GPUCompiler 2.0 now mediates all interaction with the compilation cache through a single version-agnostic entry point, cached_results, so Metal no longer needs to depend on CompilerCaching, fork its compile path on the Julia version, or implement the cache_get/cache_put! protocol for runtime library functions (GPUCompiler caches those itself now). This restores Julia 1.10 support: the same compile_or_lookup runs against the integrated code cache on 1.11+ and against GPUCompiler's session-local store on 1.10. MetalResults loses its llvm_ir slot — it only existed to serve the runtime library protocol — and keeps air/metallib/entry (session-portable) plus the per-device pipeline cache (session-local). The precompilation workload now compiles and links an actual kernel again: on 1.11+ the inference results and metallib bytes are stored in the package image (attached to the kernel's CodeInstance), so a fresh session can launch it without invoking the compiler — verified to hit the image, link in ~0.2s, and keep ObjectiveC handles out (mtlfunction skips the session-local pipeline cache while generating output). Launching the kernel during the workload is not possible: committing a command buffer and waiting for it hangs during precompilation. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
00d69c1 to
90d2673
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
x-ref JuliaGPU/GPUCompiler.jl#794