Skip to content

perf: parallelize GGUF block dequantization with rayon #121

@jamesbrink

Description

@jamesbrink

Problem

GGUF quantized transformer loading is slow (~64s for Qwen-Image Q4 on i9-13900K). The bottleneck is sequential per-block dequantization in QuantizedQwenImageTransformer2DModel::new():

for i in 0..cfg.num_layers {  // 60 blocks, one at a time
    blocks.push(QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, cpu_device)?);
}

Each block calls dequant_tensor()vb.get_no_shape(name)?.dequantize(device)?.to_dtype(dtype)? which expands Q4→F32→BF16 on CPU. With 60 independent blocks processed sequentially, only one CPU core is utilized.

Proposed Fix

Use rayon to parallelize block dequantization across CPU cores. The blocks are independent — no shared mutable state between them. On an i9-13900K (24 threads), this could reduce load time from ~64s to ~10-15s.

use rayon::prelude::*;

let blocks: Vec<QwenImageTransformerBlock> = (0..cfg.num_layers)
    .into_par_iter()
    .map(|i| QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, block_device))
    .collect::<Result<Vec<_>>>()?;

Scope

  • crates/mold-inference/src/qwen_image/quantized_transformer.rs — parallelize block construction
  • Same pattern applies to any other GGUF pipeline with block-level dequantization (Z-Image quantized transformer uses a similar loop)
  • rayon is already available as a transitive dependency via candle

Environment

  • CPU: i9-13900K (24 threads)
  • Model: qwen-image:q4 (60 transformer blocks, ~10GB GGUF → ~40GB BF16 dequantized)
  • Current load time: ~64s (release build)
  • Expected: ~10-15s with parallel dequantization

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions