Startup PSO/shader compilation takes ~2s on warm cache

## Problem

`GoldyRenderer::new` (ekrano) takes ~2–3 seconds on every startup even when both the Slang bytecode disk cache and the DX12 PSO / Vulkan pipeline cache are fully warm. This blocks the render thread from starting and delays first frame.

Tracy profiling with `ekrano.GoldyRenderer::new.compile_shaders` / `ekrano.add_shader.*` / `goldy.dx12.CreateComputePipelineState` / `goldy.ensure_stage_compiled.slang_cache` reveals a clean split by backend:

- **Vulkan**: 100% of the time is in `goldy.compute_pipeline.create_pso` → `vkCreateComputePipelines`. The Slang cache hit is fast; the bottleneck is the driver consuming the SPIR-V and building the native ISA pipeline, one pipeline at a time, even with a warm `VkPipelineCache`.
- **DX12**: 100% of the time is in `goldy.ensure_stage_compiled.slang_cache` → `compile_with_reflection`. Even on a cache hit this function re-runs `effective_slang_source_for_compile` (the full virtual-main transform over the raw Slang source string) and then `compile_cache_key` (an FNV hash of the entire transformed source) for every single pipeline creation. With ~24 shaders that is 24 full source transforms + 24 full-source hashes before a single byte of DXIL is read. `CreateComputePipelineState` with a warm PSO blob is fast by comparison.

There are also two separate contributing bugs that were fixed upstream:

1. **Cache version included git short hash** (`GOLDY_CACHE_VERSION=v{pkg}-g{git}-slang{sl}`). The `build.rs` rerun triggers fired on every commit (`.git/HEAD`, `.git/refs/heads`, `.git/logs/HEAD`), so the cache version string changed on every `git commit`, wiping the entire `shader_cache.bin.zst` cold. Fixed by dropping the git hash; per-entry keys are already content-addressed (post-transform source + defines + target + opt-level).

2. **`_label` parameter in `add_compute_shader[_with_options]` was unused** (prefixed `_`), so Tracy zones had no per-shader text annotation, making it impossible to identify which shader was slow.

## Root causes to fix

### DX12 — redundant work on every cache hit

`compile_with_reflection` always calls `effective_slang_source_for_compile(source)` and recomputes `compile_cache_key(...)` even when the in-session in-memory bytecode is already cached on `ShaderState`. The DX12 path calls `ensure_stage_compiled` → `compile_with_reflection` on every `ComputePipeline::new`, but once `compute_bytecode` is set on `ShaderState`, the bytecode is returned immediately from the in-session cache (line ~71 of `dx12/shader.rs`). The issue is that before that check fires, `ShaderModule::from_slang_with_options` creates a *new* `ShaderHandle` (storing the source) and `ensure_stage_compiled` is called fresh on it, so the in-session cache is always cold. The Slang disk cache is always hit, but `effective_slang_source_for_compile` + `compile_cache_key` run on every call regardless.

Potential fixes:
- Cache the transformed source on `ShaderState` at creation time so the transform doesn't repeat.
- Memoize `compile_cache_key` per `(Arc<str>, target, …)` (noted as a micro-optimization in a comment on line ~656 of `slang/compiler.rs`).
- For the more aggressive fix: deduplicate `ShaderHandle`s by source hash so that the same Slang source string used with the same parameters always returns the same handle, making the in-session bytecode cache actually useful.

### Vulkan — sequential single-pipeline `vkCreateComputePipelines` calls

All 24 compute pipelines are created one at a time via `create_compute_pipelines(..., &[pipeline_info], ...)`. Even with a warm `VkPipelineCache` the driver processes each call serially and the combined cost is significant.

Potential fixes:
- Batch all compute pipelines into a single `vkCreateComputePipelines` call — drivers can parallelize internally and the pipeline cache lock is taken once.
- Alternatively, create pipelines on parallel threads (each thread takes its own backend lock slice), since `vkCreateComputePipelines` is thread-safe when pipeline caches are not shared without synchronization.
- Longer term: investigate `VK_EXT_pipeline_library` or `VK_KHR_pipeline_library` for shader-module reuse across pipelines.

## Observed numbers

| Run | Time |
|-----|------|
| Cold (no caches) | ~3.1 s |
| Warm (after fix 1 above) | ~2.3 s |
| Target | < 100 ms |

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Startup PSO/shader compilation takes ~2s on warm cache #175

Problem

Root causes to fix

DX12 — redundant work on every cache hit

Vulkan — sequential single-pipeline `vkCreateComputePipelines` calls

Observed numbers

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run	Time
Cold (no caches)	~3.1 s
Warm (after fix 1 above)	~2.3 s
Target	< 100 ms

Startup PSO/shader compilation takes ~2s on warm cache #175

Description

Problem

Root causes to fix

DX12 — redundant work on every cache hit

Vulkan — sequential single-pipeline vkCreateComputePipelines calls

Observed numbers

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Vulkan — sequential single-pipeline `vkCreateComputePipelines` calls