Skip to content

feat: block-level GPU offloading for all model families #188

@jamesbrink

Description

@jamesbrink

Summary

Block-level GPU offloading (streaming transformer blocks one at a time between CPU and GPU during denoising) reduces peak VRAM at the cost of 3-5x slower inference. This is currently implemented for FLUX.1 and Qwen-Image only.

All other model families require the full transformer to fit in VRAM during denoising, limiting them to GPUs with enough memory for the entire model.

Current State

Family Engine Transformer Size Block Offload Drop-and-Reload
FLUX.1 FluxEngine 7-23 GB Yes (flux/offload.rs) T5/CLIP dropped after encoding
Qwen-Image QwenImageEngine 11-38 GB Yes (qwen_image/offload.rs) Qwen2.5-VL dropped after encoding
Flux.2 Flux2Engine 4-17 GB No Qwen3/T5 dropped after encoding
Z-Image ZImageEngine 3-23 GB No Qwen3 dropped after encoding, transformer dropped for VAE
SD3.5 SD3Engine 2-8 GB No Triple encoder dropped after encoding
LTX Video LtxVideoEngine 3.6 GB No T5 dropped after encoding, transformer dropped for VAE
SDXL SDXLEngine ~5 GB No No
SD 1.5 SD15Engine ~3 GB No UNet dropped for VAE decode
Wuerstchen WuerstchenEngine ~3 GB (Prior + Decoder) No Prior + Decoder dropped for VQ-GAN

Remaining checklist

High priority (large transformers, most benefit)

  • FLUX.1flux/offload.rs, double + single blocks
  • Qwen-Imageqwen_image/offload.rs, 60 blocks, CPU-staged GGUF dequant, split-CFG
  • Z-Image — flow-matching transformer, up to 23 GB BF16. Similar architecture to FLUX.
  • Flux.2 — shared-modulation transformer, partially reusable from flux/offload.rs

Medium priority

  • SD3.5 Large — 8 GB MMDiT. Would help 8 GB GPUs.
  • LTX Video — 3.6 GB transformer but high VRAM pressure from 3D latent volume

Lower priority (small transformers, already fit on most GPUs)

  • SDXL — ~5 GB UNet
  • SD 1.5 — ~3 GB UNet
  • Wuerstchen — ~3 GB combined Prior + Decoder

Implementation Approach

Extract shared offload infrastructure

The current flux/offload.rs and qwen_image/offload.rs are self-contained with family-specific block types. To support remaining families, extract a shared pattern:

  1. Trait-based block offloading — define a trait that each engine's transformer blocks can implement:

    pub trait OffloadableBlock: Send {
        fn to_device(&mut self, device: &Device) -> Result<()>;
        fn forward(&self, input: &BlockInput) -> Result<BlockOutput>;
        fn weight_bytes(&self) -> usize;
    }
  2. Generic offload runner — shared CPU↔GPU streaming loop with progress reporting and memory management.

  3. Per-engine block implementations — each family implements OffloadableBlock for its specific block type.

Architecture-specific considerations

  • UNet-based models (SD1.5, SDXL): Offload at residual block level. Skip connections need careful handling.
  • MMDiT-based models (SD3.5): Uniform blocks, straightforward streaming similar to FLUX.
  • Flow-matching transformers (Z-Image, Flux.2, LTX Video): Most similar to existing FLUX/Qwen-Image offload.
  • Wuerstchen: Two-stage cascade — offload each stage's blocks independently.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions