perf: parallelize GGUF block dequantization with rayon

## Problem

GGUF quantized transformer loading is slow (~64s for Qwen-Image Q4 on i9-13900K). The bottleneck is sequential per-block dequantization in `QuantizedQwenImageTransformer2DModel::new()`:

```rust
for i in 0..cfg.num_layers {  // 60 blocks, one at a time
    blocks.push(QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, cpu_device)?);
}
```

Each block calls `dequant_tensor()` → `vb.get_no_shape(name)?.dequantize(device)?.to_dtype(dtype)?` which expands Q4→F32→BF16 on CPU. With 60 independent blocks processed sequentially, only one CPU core is utilized.

## Proposed Fix

Use `rayon` to parallelize block dequantization across CPU cores. The blocks are independent — no shared mutable state between them. On an i9-13900K (24 threads), this could reduce load time from ~64s to ~10-15s.

```rust
use rayon::prelude::*;

let blocks: Vec<QwenImageTransformerBlock> = (0..cfg.num_layers)
    .into_par_iter()
    .map(|i| QwenImageTransformerBlock::new(cfg, vb_blocks.pp(i), dtype, block_device))
    .collect::<Result<Vec<_>>>()?;
```

## Scope

- `crates/mold-inference/src/qwen_image/quantized_transformer.rs` — parallelize block construction
- Same pattern applies to any other GGUF pipeline with block-level dequantization (Z-Image quantized transformer uses a similar loop)
- `rayon` is already available as a transitive dependency via candle

## Environment

- CPU: i9-13900K (24 threads)
- Model: qwen-image:q4 (60 transformer blocks, ~10GB GGUF → ~40GB BF16 dequantized)
- Current load time: ~64s (release build)
- Expected: ~10-15s with parallel dequantization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: parallelize GGUF block dequantization with rayon #121

Problem

Proposed Fix

Scope

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: parallelize GGUF block dequantization with rayon #121

Description

Problem

Proposed Fix

Scope

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions