Skip to content

Startup PSO/shader compilation takes ~2s on warm cache #175

@koubaa

Description

@koubaa

Problem

GoldyRenderer::new (ekrano) takes ~2–3 seconds on every startup even when both the Slang bytecode disk cache and the DX12 PSO / Vulkan pipeline cache are fully warm. This blocks the render thread from starting and delays first frame.

Tracy profiling with ekrano.GoldyRenderer::new.compile_shaders / ekrano.add_shader.* / goldy.dx12.CreateComputePipelineState / goldy.ensure_stage_compiled.slang_cache reveals a clean split by backend:

  • Vulkan: 100% of the time is in goldy.compute_pipeline.create_psovkCreateComputePipelines. The Slang cache hit is fast; the bottleneck is the driver consuming the SPIR-V and building the native ISA pipeline, one pipeline at a time, even with a warm VkPipelineCache.
  • DX12: 100% of the time is in goldy.ensure_stage_compiled.slang_cachecompile_with_reflection. Even on a cache hit this function re-runs effective_slang_source_for_compile (the full virtual-main transform over the raw Slang source string) and then compile_cache_key (an FNV hash of the entire transformed source) for every single pipeline creation. With ~24 shaders that is 24 full source transforms + 24 full-source hashes before a single byte of DXIL is read. CreateComputePipelineState with a warm PSO blob is fast by comparison.

There are also two separate contributing bugs that were fixed upstream:

  1. Cache version included git short hash (GOLDY_CACHE_VERSION=v{pkg}-g{git}-slang{sl}). The build.rs rerun triggers fired on every commit (.git/HEAD, .git/refs/heads, .git/logs/HEAD), so the cache version string changed on every git commit, wiping the entire shader_cache.bin.zst cold. Fixed by dropping the git hash; per-entry keys are already content-addressed (post-transform source + defines + target + opt-level).

  2. _label parameter in add_compute_shader[_with_options] was unused (prefixed _), so Tracy zones had no per-shader text annotation, making it impossible to identify which shader was slow.

Root causes to fix

DX12 — redundant work on every cache hit

compile_with_reflection always calls effective_slang_source_for_compile(source) and recomputes compile_cache_key(...) even when the in-session in-memory bytecode is already cached on ShaderState. The DX12 path calls ensure_stage_compiledcompile_with_reflection on every ComputePipeline::new, but once compute_bytecode is set on ShaderState, the bytecode is returned immediately from the in-session cache (line ~71 of dx12/shader.rs). The issue is that before that check fires, ShaderModule::from_slang_with_options creates a new ShaderHandle (storing the source) and ensure_stage_compiled is called fresh on it, so the in-session cache is always cold. The Slang disk cache is always hit, but effective_slang_source_for_compile + compile_cache_key run on every call regardless.

Potential fixes:

  • Cache the transformed source on ShaderState at creation time so the transform doesn't repeat.
  • Memoize compile_cache_key per (Arc<str>, target, …) (noted as a micro-optimization in a comment on line ~656 of slang/compiler.rs).
  • For the more aggressive fix: deduplicate ShaderHandles by source hash so that the same Slang source string used with the same parameters always returns the same handle, making the in-session bytecode cache actually useful.

Vulkan — sequential single-pipeline vkCreateComputePipelines calls

All 24 compute pipelines are created one at a time via create_compute_pipelines(..., &[pipeline_info], ...). Even with a warm VkPipelineCache the driver processes each call serially and the combined cost is significant.

Potential fixes:

  • Batch all compute pipelines into a single vkCreateComputePipelines call — drivers can parallelize internally and the pipeline cache lock is taken once.
  • Alternatively, create pipelines on parallel threads (each thread takes its own backend lock slice), since vkCreateComputePipelines is thread-safe when pipeline caches are not shared without synchronization.
  • Longer term: investigate VK_EXT_pipeline_library or VK_KHR_pipeline_library for shader-module reuse across pipelines.

Observed numbers

Run Time
Cold (no caches) ~3.1 s
Warm (after fix 1 above) ~2.3 s
Target < 100 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions