Summary
The critical Qwen CUDA regressions are fixed on fix/qwen-image-cuda-offload-2512, but there is still meaningful performance work left, especially around the local cold path and more intelligent mode selection.
Landed fixes
- Split Qwen execution behavior by path instead of forcing one compromise across both local sequential and hot server modes.
- Hot server path now keeps the quantized Qwen transformer resident across VAE decode when possible.
- Hot server prompt cache hits no longer reload the Qwen2.5 encoder before denoising.
- Quantized CUDA path disables CFG batching when native-resolution headroom is too tight.
- Server thumbnail warmup is now disabled by default; thumbnails are generated on demand unless
MOLD_THUMBNAIL_WARMUP=1.
- Added regression tests for Qwen planner behavior and server thumbnail warmup env gating.
Benchmark snapshot
Test case: qwen-image-2512:q4, 1328x1328, --steps 1
- Local CLI cold: about
110.6s
- Dominated by BF16 Qwen2.5 text encoder load (
83.4s)
- Denoise about
8.9s
- VAE tiled GPU decode about
2.4s
- Hot server cold after load: about
13.7s request time once the model is resident
- Transformer load before first request:
42.1s
- Resident Qwen2.5
q4 GGUF GPU encoder load during cold request: 23.5s
- Denoise about
8.9s
- VAE tiled GPU decode about
2.0s
- Hot server warm request with prompt cache hit: about
13.1s
- No Qwen2.5 encoder reload on cache hit
- Denoise about
8.7s
- VAE tiled GPU decode about
2.3s
Remaining work
1. Fix local cold-start cost for Qwen-2512
The local sequential path still spends most of its time loading the BF16 Qwen2.5 text encoder. We need a local-only policy that is faster without reintroducing the VRAM poisoning and CPU/RAM blowups seen during experiments.
Options to evaluate:
- Quantized Qwen2.5 encoder on CPU with mmap-friendly reuse and better host-memory behavior.
- Transformer-first local planning with a smaller quantized encoder when it actually reduces total wall clock.
- Resolution-aware and device-aware encoder selection instead of a static threshold.
2. Add a real benchmark harness
We need repeatable benchmarks instead of manual shell runs.
Add a benchmark command or script that records:
- cold local time
- cold hot-server load time
- warm hot-server request time
- transformer reload count
- encoder reload count
- VAE mode selected
- output image filename in repo root
3. Make VAE decode selection proactive instead of retry-based
Right now native Qwen-2512 usually hits GPU VAE OOM and then falls back to tiled GPU decode. That works, but it wastes time.
Investigate:
- selecting tiled GPU decode up front for known native-resolution Qwen CUDA cases
- tile-size heuristics based on free VRAM
- whether warm hot-server mode should always prefer tiled decode for this model
4. Tighten hot-path residency policy
The current hot path is much better, but we should continue improving residency decisions.
Potential work:
- keep the Qwen2.5 encoder resident on GPU for very short repeat intervals when it is cheaper than reload
- optionally keep it CPU-resident when prompt cache miss rate is high
- make the resident text-encoder choice depend on actual free VRAM after transformer load, not only a coarse threshold
5. Add more regression coverage
Add tests for:
- hot server cache hit path skipping Qwen2.5 encoder reload
- hot path preserving transformer residency across tiled VAE decode
- local and hot paths selecting different encoder policies intentionally
- native-resolution Qwen-2512 not producing black images on CUDA
6. Consider black-image focused diagnostics and regression tests
The original black-image symptom on CUDA for qwen-image-2512 is improved by the math/path fixes, but this still deserves stronger regression coverage.
Add:
- latent-range / NaN guard assertions during denoise
- decoded output sanity checks for black or near-black frames
- one reproducible seeded integration test or smoke harness for CUDA-capable environments
Nice-to-have follow-ups
- Silence or rate-limit noisy server startup warnings around corrupt legacy gallery images.
- Emit one compact structured summary of selected Qwen execution mode in logs.
- Store benchmark outputs and timing summaries in a stable repo-root naming convention.
Summary
The critical Qwen CUDA regressions are fixed on
fix/qwen-image-cuda-offload-2512, but there is still meaningful performance work left, especially around the local cold path and more intelligent mode selection.Landed fixes
MOLD_THUMBNAIL_WARMUP=1.Benchmark snapshot
Test case:
qwen-image-2512:q4,1328x1328,--steps 1110.6s83.4s)8.9s2.4s13.7srequest time once the model is resident42.1sq4GGUF GPU encoder load during cold request:23.5s8.9s2.0s13.1s8.7s2.3sRemaining work
1. Fix local cold-start cost for Qwen-2512
The local sequential path still spends most of its time loading the BF16 Qwen2.5 text encoder. We need a local-only policy that is faster without reintroducing the VRAM poisoning and CPU/RAM blowups seen during experiments.
Options to evaluate:
2. Add a real benchmark harness
We need repeatable benchmarks instead of manual shell runs.
Add a benchmark command or script that records:
3. Make VAE decode selection proactive instead of retry-based
Right now native Qwen-2512 usually hits GPU VAE OOM and then falls back to tiled GPU decode. That works, but it wastes time.
Investigate:
4. Tighten hot-path residency policy
The current hot path is much better, but we should continue improving residency decisions.
Potential work:
5. Add more regression coverage
Add tests for:
6. Consider black-image focused diagnostics and regression tests
The original black-image symptom on CUDA for
qwen-image-2512is improved by the math/path fixes, but this still deserves stronger regression coverage.Add:
Nice-to-have follow-ups