Qwen-2512 remaining CUDA perf work: local cold path, benchmark harness, and proactive VAE planning

## Summary
The critical Qwen CUDA regressions are fixed on `fix/qwen-image-cuda-offload-2512`, but there is still meaningful performance work left, especially around the local cold path and more intelligent mode selection.

## Landed fixes
- Split Qwen execution behavior by path instead of forcing one compromise across both local sequential and hot server modes.
- Hot server path now keeps the quantized Qwen transformer resident across VAE decode when possible.
- Hot server prompt cache hits no longer reload the Qwen2.5 encoder before denoising.
- Quantized CUDA path disables CFG batching when native-resolution headroom is too tight.
- Server thumbnail warmup is now disabled by default; thumbnails are generated on demand unless `MOLD_THUMBNAIL_WARMUP=1`.
- Added regression tests for Qwen planner behavior and server thumbnail warmup env gating.

## Benchmark snapshot
Test case: `qwen-image-2512:q4`, `1328x1328`, `--steps 1`

- Local CLI cold: about `110.6s`
  - Dominated by BF16 Qwen2.5 text encoder load (`83.4s`)
  - Denoise about `8.9s`
  - VAE tiled GPU decode about `2.4s`
- Hot server cold after load: about `13.7s` request time once the model is resident
  - Transformer load before first request: `42.1s`
  - Resident Qwen2.5 `q4` GGUF GPU encoder load during cold request: `23.5s`
  - Denoise about `8.9s`
  - VAE tiled GPU decode about `2.0s`
- Hot server warm request with prompt cache hit: about `13.1s`
  - No Qwen2.5 encoder reload on cache hit
  - Denoise about `8.7s`
  - VAE tiled GPU decode about `2.3s`

## Remaining work
### 1. Fix local cold-start cost for Qwen-2512
The local sequential path still spends most of its time loading the BF16 Qwen2.5 text encoder. We need a local-only policy that is faster without reintroducing the VRAM poisoning and CPU/RAM blowups seen during experiments.

Options to evaluate:
- Quantized Qwen2.5 encoder on CPU with mmap-friendly reuse and better host-memory behavior.
- Transformer-first local planning with a smaller quantized encoder when it actually reduces total wall clock.
- Resolution-aware and device-aware encoder selection instead of a static threshold.

### 2. Add a real benchmark harness
We need repeatable benchmarks instead of manual shell runs.

Add a benchmark command or script that records:
- cold local time
- cold hot-server load time
- warm hot-server request time
- transformer reload count
- encoder reload count
- VAE mode selected
- output image filename in repo root

### 3. Make VAE decode selection proactive instead of retry-based
Right now native Qwen-2512 usually hits GPU VAE OOM and then falls back to tiled GPU decode. That works, but it wastes time.

Investigate:
- selecting tiled GPU decode up front for known native-resolution Qwen CUDA cases
- tile-size heuristics based on free VRAM
- whether warm hot-server mode should always prefer tiled decode for this model

### 4. Tighten hot-path residency policy
The current hot path is much better, but we should continue improving residency decisions.

Potential work:
- keep the Qwen2.5 encoder resident on GPU for very short repeat intervals when it is cheaper than reload
- optionally keep it CPU-resident when prompt cache miss rate is high
- make the resident text-encoder choice depend on actual free VRAM after transformer load, not only a coarse threshold

### 5. Add more regression coverage
Add tests for:
- hot server cache hit path skipping Qwen2.5 encoder reload
- hot path preserving transformer residency across tiled VAE decode
- local and hot paths selecting different encoder policies intentionally
- native-resolution Qwen-2512 not producing black images on CUDA

### 6. Consider black-image focused diagnostics and regression tests
The original black-image symptom on CUDA for `qwen-image-2512` is improved by the math/path fixes, but this still deserves stronger regression coverage.

Add:
- latent-range / NaN guard assertions during denoise
- decoded output sanity checks for black or near-black frames
- one reproducible seeded integration test or smoke harness for CUDA-capable environments

## Nice-to-have follow-ups
- Silence or rate-limit noisy server startup warnings around corrupt legacy gallery images.
- Emit one compact structured summary of selected Qwen execution mode in logs.
- Store benchmark outputs and timing summaries in a stable repo-root naming convention.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen-2512 remaining CUDA perf work: local cold path, benchmark harness, and proactive VAE planning #212

Summary

Landed fixes

Benchmark snapshot

Remaining work

1. Fix local cold-start cost for Qwen-2512

2. Add a real benchmark harness

3. Make VAE decode selection proactive instead of retry-based

4. Tighten hot-path residency policy

5. Add more regression coverage

6. Consider black-image focused diagnostics and regression tests

Nice-to-have follow-ups

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen-2512 remaining CUDA perf work: local cold path, benchmark harness, and proactive VAE planning #212

Description

Summary

Landed fixes

Benchmark snapshot

Remaining work

1. Fix local cold-start cost for Qwen-2512

2. Add a real benchmark harness

3. Make VAE decode selection proactive instead of retry-based

4. Tighten hot-path residency policy

5. Add more regression coverage

6. Consider black-image focused diagnostics and regression tests

Nice-to-have follow-ups

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions