Problem
GoldyRenderer::new (ekrano) takes ~2–3 seconds on every startup even when both the Slang bytecode disk cache and the DX12 PSO / Vulkan pipeline cache are fully warm. This blocks the render thread from starting and delays first frame.
Tracy profiling with ekrano.GoldyRenderer::new.compile_shaders / ekrano.add_shader.* / goldy.dx12.CreateComputePipelineState / goldy.ensure_stage_compiled.slang_cache reveals a clean split by backend:
- Vulkan: 100% of the time is in
goldy.compute_pipeline.create_pso → vkCreateComputePipelines. The Slang cache hit is fast; the bottleneck is the driver consuming the SPIR-V and building the native ISA pipeline, one pipeline at a time, even with a warm VkPipelineCache.
- DX12: 100% of the time is in
goldy.ensure_stage_compiled.slang_cache → compile_with_reflection. Even on a cache hit this function re-runs effective_slang_source_for_compile (the full virtual-main transform over the raw Slang source string) and then compile_cache_key (an FNV hash of the entire transformed source) for every single pipeline creation. With ~24 shaders that is 24 full source transforms + 24 full-source hashes before a single byte of DXIL is read. CreateComputePipelineState with a warm PSO blob is fast by comparison.
There are also two separate contributing bugs that were fixed upstream:
-
Cache version included git short hash (GOLDY_CACHE_VERSION=v{pkg}-g{git}-slang{sl}). The build.rs rerun triggers fired on every commit (.git/HEAD, .git/refs/heads, .git/logs/HEAD), so the cache version string changed on every git commit, wiping the entire shader_cache.bin.zst cold. Fixed by dropping the git hash; per-entry keys are already content-addressed (post-transform source + defines + target + opt-level).
-
_label parameter in add_compute_shader[_with_options] was unused (prefixed _), so Tracy zones had no per-shader text annotation, making it impossible to identify which shader was slow.
Root causes to fix
DX12 — redundant work on every cache hit
compile_with_reflection always calls effective_slang_source_for_compile(source) and recomputes compile_cache_key(...) even when the in-session in-memory bytecode is already cached on ShaderState. The DX12 path calls ensure_stage_compiled → compile_with_reflection on every ComputePipeline::new, but once compute_bytecode is set on ShaderState, the bytecode is returned immediately from the in-session cache (line ~71 of dx12/shader.rs). The issue is that before that check fires, ShaderModule::from_slang_with_options creates a new ShaderHandle (storing the source) and ensure_stage_compiled is called fresh on it, so the in-session cache is always cold. The Slang disk cache is always hit, but effective_slang_source_for_compile + compile_cache_key run on every call regardless.
Potential fixes:
- Cache the transformed source on
ShaderState at creation time so the transform doesn't repeat.
- Memoize
compile_cache_key per (Arc<str>, target, …) (noted as a micro-optimization in a comment on line ~656 of slang/compiler.rs).
- For the more aggressive fix: deduplicate
ShaderHandles by source hash so that the same Slang source string used with the same parameters always returns the same handle, making the in-session bytecode cache actually useful.
Vulkan — sequential single-pipeline vkCreateComputePipelines calls
All 24 compute pipelines are created one at a time via create_compute_pipelines(..., &[pipeline_info], ...). Even with a warm VkPipelineCache the driver processes each call serially and the combined cost is significant.
Potential fixes:
- Batch all compute pipelines into a single
vkCreateComputePipelines call — drivers can parallelize internally and the pipeline cache lock is taken once.
- Alternatively, create pipelines on parallel threads (each thread takes its own backend lock slice), since
vkCreateComputePipelines is thread-safe when pipeline caches are not shared without synchronization.
- Longer term: investigate
VK_EXT_pipeline_library or VK_KHR_pipeline_library for shader-module reuse across pipelines.
Observed numbers
| Run |
Time |
| Cold (no caches) |
~3.1 s |
| Warm (after fix 1 above) |
~2.3 s |
| Target |
< 100 ms |
Problem
GoldyRenderer::new(ekrano) takes ~2–3 seconds on every startup even when both the Slang bytecode disk cache and the DX12 PSO / Vulkan pipeline cache are fully warm. This blocks the render thread from starting and delays first frame.Tracy profiling with
ekrano.GoldyRenderer::new.compile_shaders/ekrano.add_shader.*/goldy.dx12.CreateComputePipelineState/goldy.ensure_stage_compiled.slang_cachereveals a clean split by backend:goldy.compute_pipeline.create_pso→vkCreateComputePipelines. The Slang cache hit is fast; the bottleneck is the driver consuming the SPIR-V and building the native ISA pipeline, one pipeline at a time, even with a warmVkPipelineCache.goldy.ensure_stage_compiled.slang_cache→compile_with_reflection. Even on a cache hit this function re-runseffective_slang_source_for_compile(the full virtual-main transform over the raw Slang source string) and thencompile_cache_key(an FNV hash of the entire transformed source) for every single pipeline creation. With ~24 shaders that is 24 full source transforms + 24 full-source hashes before a single byte of DXIL is read.CreateComputePipelineStatewith a warm PSO blob is fast by comparison.There are also two separate contributing bugs that were fixed upstream:
Cache version included git short hash (
GOLDY_CACHE_VERSION=v{pkg}-g{git}-slang{sl}). Thebuild.rsrerun triggers fired on every commit (.git/HEAD,.git/refs/heads,.git/logs/HEAD), so the cache version string changed on everygit commit, wiping the entireshader_cache.bin.zstcold. Fixed by dropping the git hash; per-entry keys are already content-addressed (post-transform source + defines + target + opt-level)._labelparameter inadd_compute_shader[_with_options]was unused (prefixed_), so Tracy zones had no per-shader text annotation, making it impossible to identify which shader was slow.Root causes to fix
DX12 — redundant work on every cache hit
compile_with_reflectionalways callseffective_slang_source_for_compile(source)and recomputescompile_cache_key(...)even when the in-session in-memory bytecode is already cached onShaderState. The DX12 path callsensure_stage_compiled→compile_with_reflectionon everyComputePipeline::new, but oncecompute_bytecodeis set onShaderState, the bytecode is returned immediately from the in-session cache (line ~71 ofdx12/shader.rs). The issue is that before that check fires,ShaderModule::from_slang_with_optionscreates a newShaderHandle(storing the source) andensure_stage_compiledis called fresh on it, so the in-session cache is always cold. The Slang disk cache is always hit, buteffective_slang_source_for_compile+compile_cache_keyrun on every call regardless.Potential fixes:
ShaderStateat creation time so the transform doesn't repeat.compile_cache_keyper(Arc<str>, target, …)(noted as a micro-optimization in a comment on line ~656 ofslang/compiler.rs).ShaderHandles by source hash so that the same Slang source string used with the same parameters always returns the same handle, making the in-session bytecode cache actually useful.Vulkan — sequential single-pipeline
vkCreateComputePipelinescallsAll 24 compute pipelines are created one at a time via
create_compute_pipelines(..., &[pipeline_info], ...). Even with a warmVkPipelineCachethe driver processes each call serially and the combined cost is significant.Potential fixes:
vkCreateComputePipelinescall — drivers can parallelize internally and the pipeline cache lock is taken once.vkCreateComputePipelinesis thread-safe when pipeline caches are not shared without synchronization.VK_EXT_pipeline_libraryorVK_KHR_pipeline_libraryfor shader-module reuse across pipelines.Observed numbers