Summary
Block-level GPU offloading (streaming transformer blocks one at a time between CPU and GPU during denoising) reduces peak VRAM at the cost of 3-5x slower inference. This is currently implemented for FLUX.1 and Qwen-Image only.
All other model families require the full transformer to fit in VRAM during denoising, limiting them to GPUs with enough memory for the entire model.
Current State
| Family |
Engine |
Transformer Size |
Block Offload |
Drop-and-Reload |
| FLUX.1 |
FluxEngine |
7-23 GB |
Yes (flux/offload.rs) |
T5/CLIP dropped after encoding |
| Qwen-Image |
QwenImageEngine |
11-38 GB |
Yes (qwen_image/offload.rs) |
Qwen2.5-VL dropped after encoding |
| Flux.2 |
Flux2Engine |
4-17 GB |
No |
Qwen3/T5 dropped after encoding |
| Z-Image |
ZImageEngine |
3-23 GB |
No |
Qwen3 dropped after encoding, transformer dropped for VAE |
| SD3.5 |
SD3Engine |
2-8 GB |
No |
Triple encoder dropped after encoding |
| LTX Video |
LtxVideoEngine |
3.6 GB |
No |
T5 dropped after encoding, transformer dropped for VAE |
| SDXL |
SDXLEngine |
~5 GB |
No |
No |
| SD 1.5 |
SD15Engine |
~3 GB |
No |
UNet dropped for VAE decode |
| Wuerstchen |
WuerstchenEngine |
~3 GB (Prior + Decoder) |
No |
Prior + Decoder dropped for VQ-GAN |
Remaining checklist
High priority (large transformers, most benefit)
Medium priority
Lower priority (small transformers, already fit on most GPUs)
Implementation Approach
Extract shared offload infrastructure
The current flux/offload.rs and qwen_image/offload.rs are self-contained with family-specific block types. To support remaining families, extract a shared pattern:
-
Trait-based block offloading — define a trait that each engine's transformer blocks can implement:
pub trait OffloadableBlock: Send {
fn to_device(&mut self, device: &Device) -> Result<()>;
fn forward(&self, input: &BlockInput) -> Result<BlockOutput>;
fn weight_bytes(&self) -> usize;
}
-
Generic offload runner — shared CPU↔GPU streaming loop with progress reporting and memory management.
-
Per-engine block implementations — each family implements OffloadableBlock for its specific block type.
Architecture-specific considerations
- UNet-based models (SD1.5, SDXL): Offload at residual block level. Skip connections need careful handling.
- MMDiT-based models (SD3.5): Uniform blocks, straightforward streaming similar to FLUX.
- Flow-matching transformers (Z-Image, Flux.2, LTX Video): Most similar to existing FLUX/Qwen-Image offload.
- Wuerstchen: Two-stage cascade — offload each stage's blocks independently.
References
Summary
Block-level GPU offloading (streaming transformer blocks one at a time between CPU and GPU during denoising) reduces peak VRAM at the cost of 3-5x slower inference. This is currently implemented for FLUX.1 and Qwen-Image only.
All other model families require the full transformer to fit in VRAM during denoising, limiting them to GPUs with enough memory for the entire model.
Current State
FluxEngineflux/offload.rs)QwenImageEngineqwen_image/offload.rs)Flux2EngineZImageEngineSD3EngineLtxVideoEngineSDXLEngineSD15EngineWuerstchenEngineRemaining checklist
High priority (large transformers, most benefit)
flux/offload.rs, double + single blocksqwen_image/offload.rs, 60 blocks, CPU-staged GGUF dequant, split-CFGflux/offload.rsMedium priority
Lower priority (small transformers, already fit on most GPUs)
Implementation Approach
Extract shared offload infrastructure
The current
flux/offload.rsandqwen_image/offload.rsare self-contained with family-specific block types. To support remaining families, extract a shared pattern:Trait-based block offloading — define a trait that each engine's transformer blocks can implement:
Generic offload runner — shared CPU↔GPU streaming loop with progress reporting and memory management.
Per-engine block implementations — each family implements
OffloadableBlockfor its specific block type.Architecture-specific considerations
References
crates/mold-inference/src/flux/offload.rs— FLUX.1 block offloadcrates/mold-inference/src/qwen_image/offload.rs— Qwen-Image block offloadcrates/mold-inference/src/device.rs— VRAM detection and device placement