Skip to content

Qwen-2512 remaining CUDA perf work: local cold path, benchmark harness, and proactive VAE planning #212

@jamesbrink

Description

@jamesbrink

Summary

The critical Qwen CUDA regressions are fixed on fix/qwen-image-cuda-offload-2512, but there is still meaningful performance work left, especially around the local cold path and more intelligent mode selection.

Landed fixes

  • Split Qwen execution behavior by path instead of forcing one compromise across both local sequential and hot server modes.
  • Hot server path now keeps the quantized Qwen transformer resident across VAE decode when possible.
  • Hot server prompt cache hits no longer reload the Qwen2.5 encoder before denoising.
  • Quantized CUDA path disables CFG batching when native-resolution headroom is too tight.
  • Server thumbnail warmup is now disabled by default; thumbnails are generated on demand unless MOLD_THUMBNAIL_WARMUP=1.
  • Added regression tests for Qwen planner behavior and server thumbnail warmup env gating.

Benchmark snapshot

Test case: qwen-image-2512:q4, 1328x1328, --steps 1

  • Local CLI cold: about 110.6s
    • Dominated by BF16 Qwen2.5 text encoder load (83.4s)
    • Denoise about 8.9s
    • VAE tiled GPU decode about 2.4s
  • Hot server cold after load: about 13.7s request time once the model is resident
    • Transformer load before first request: 42.1s
    • Resident Qwen2.5 q4 GGUF GPU encoder load during cold request: 23.5s
    • Denoise about 8.9s
    • VAE tiled GPU decode about 2.0s
  • Hot server warm request with prompt cache hit: about 13.1s
    • No Qwen2.5 encoder reload on cache hit
    • Denoise about 8.7s
    • VAE tiled GPU decode about 2.3s

Remaining work

1. Fix local cold-start cost for Qwen-2512

The local sequential path still spends most of its time loading the BF16 Qwen2.5 text encoder. We need a local-only policy that is faster without reintroducing the VRAM poisoning and CPU/RAM blowups seen during experiments.

Options to evaluate:

  • Quantized Qwen2.5 encoder on CPU with mmap-friendly reuse and better host-memory behavior.
  • Transformer-first local planning with a smaller quantized encoder when it actually reduces total wall clock.
  • Resolution-aware and device-aware encoder selection instead of a static threshold.

2. Add a real benchmark harness

We need repeatable benchmarks instead of manual shell runs.

Add a benchmark command or script that records:

  • cold local time
  • cold hot-server load time
  • warm hot-server request time
  • transformer reload count
  • encoder reload count
  • VAE mode selected
  • output image filename in repo root

3. Make VAE decode selection proactive instead of retry-based

Right now native Qwen-2512 usually hits GPU VAE OOM and then falls back to tiled GPU decode. That works, but it wastes time.

Investigate:

  • selecting tiled GPU decode up front for known native-resolution Qwen CUDA cases
  • tile-size heuristics based on free VRAM
  • whether warm hot-server mode should always prefer tiled decode for this model

4. Tighten hot-path residency policy

The current hot path is much better, but we should continue improving residency decisions.

Potential work:

  • keep the Qwen2.5 encoder resident on GPU for very short repeat intervals when it is cheaper than reload
  • optionally keep it CPU-resident when prompt cache miss rate is high
  • make the resident text-encoder choice depend on actual free VRAM after transformer load, not only a coarse threshold

5. Add more regression coverage

Add tests for:

  • hot server cache hit path skipping Qwen2.5 encoder reload
  • hot path preserving transformer residency across tiled VAE decode
  • local and hot paths selecting different encoder policies intentionally
  • native-resolution Qwen-2512 not producing black images on CUDA

6. Consider black-image focused diagnostics and regression tests

The original black-image symptom on CUDA for qwen-image-2512 is improved by the math/path fixes, but this still deserves stronger regression coverage.

Add:

  • latent-range / NaN guard assertions during denoise
  • decoded output sanity checks for black or near-black frames
  • one reproducible seeded integration test or smoke harness for CUDA-capable environments

Nice-to-have follow-ups

  • Silence or rate-limit noisy server startup warnings around corrupt legacy gallery images.
  • Emit one compact structured summary of selected Qwen execution mode in logs.
  • Store benchmark outputs and timing summaries in a stable repo-root naming convention.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions