Problem
GGUF quantized transformer loading is slow (~64s for Qwen-Image Q4 on i9-13900K). The bottleneck is sequential per-block dequantization in QuantizedQwenImageTransformer2DModel::new():
for i in 0..cfg.num_layers { // 60 blocks, one at a time
blocks.push(QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, cpu_device)?);
}
Each block calls dequant_tensor() → vb.get_no_shape(name)?.dequantize(device)?.to_dtype(dtype)? which expands Q4→F32→BF16 on CPU. With 60 independent blocks processed sequentially, only one CPU core is utilized.
Proposed Fix
Use rayon to parallelize block dequantization across CPU cores. The blocks are independent — no shared mutable state between them. On an i9-13900K (24 threads), this could reduce load time from ~64s to ~10-15s.
use rayon::prelude::*;
let blocks: Vec<QwenImageTransformerBlock> = (0..cfg.num_layers)
.into_par_iter()
.map(|i| QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, block_device))
.collect::<Result<Vec<_>>>()?;
Scope
crates/mold-inference/src/qwen_image/quantized_transformer.rs — parallelize block construction
- Same pattern applies to any other GGUF pipeline with block-level dequantization (Z-Image quantized transformer uses a similar loop)
rayon is already available as a transitive dependency via candle
Environment
- CPU: i9-13900K (24 threads)
- Model: qwen-image:q4 (60 transformer blocks, ~10GB GGUF → ~40GB BF16 dequantized)
- Current load time: ~64s (release build)
- Expected: ~10-15s with parallel dequantization
Problem
GGUF quantized transformer loading is slow (~64s for Qwen-Image Q4 on i9-13900K). The bottleneck is sequential per-block dequantization in
QuantizedQwenImageTransformer2DModel::new():Each block calls
dequant_tensor()→vb.get_no_shape(name)?.dequantize(device)?.to_dtype(dtype)?which expands Q4→F32→BF16 on CPU. With 60 independent blocks processed sequentially, only one CPU core is utilized.Proposed Fix
Use
rayonto parallelize block dequantization across CPU cores. The blocks are independent — no shared mutable state between them. On an i9-13900K (24 threads), this could reduce load time from ~64s to ~10-15s.Scope
crates/mold-inference/src/qwen_image/quantized_transformer.rs— parallelize block constructionrayonis already available as a transitive dependency via candleEnvironment