feat: block-level GPU offloading for all model families

## Summary

Block-level GPU offloading (streaming transformer blocks one at a time between CPU and GPU during denoising) reduces peak VRAM at the cost of 3-5x slower inference. This is currently implemented for FLUX.1 and Qwen-Image only.

All other model families require the full transformer to fit in VRAM during denoising, limiting them to GPUs with enough memory for the entire model.

## Current State

| Family | Engine | Transformer Size | Block Offload | Drop-and-Reload |
|--------|--------|-----------------|---------------|-----------------|
| FLUX.1 | `FluxEngine` | 7-23 GB | **Yes** (`flux/offload.rs`) | T5/CLIP dropped after encoding |
| Qwen-Image | `QwenImageEngine` | 11-38 GB | **Yes** (`qwen_image/offload.rs`) | Qwen2.5-VL dropped after encoding |
| Flux.2 | `Flux2Engine` | 4-17 GB | No | Qwen3/T5 dropped after encoding |
| Z-Image | `ZImageEngine` | 3-23 GB | No | Qwen3 dropped after encoding, transformer dropped for VAE |
| SD3.5 | `SD3Engine` | 2-8 GB | No | Triple encoder dropped after encoding |
| LTX Video | `LtxVideoEngine` | 3.6 GB | No | T5 dropped after encoding, transformer dropped for VAE |
| SDXL | `SDXLEngine` | ~5 GB | No | No |
| SD 1.5 | `SD15Engine` | ~3 GB | No | UNet dropped for VAE decode |
| Wuerstchen | `WuerstchenEngine` | ~3 GB (Prior + Decoder) | No | Prior + Decoder dropped for VQ-GAN |

## Remaining checklist

### High priority (large transformers, most benefit)
- [x] **FLUX.1** — `flux/offload.rs`, double + single blocks
- [x] **Qwen-Image** — `qwen_image/offload.rs`, 60 blocks, CPU-staged GGUF dequant, split-CFG
- [ ] **Z-Image** — flow-matching transformer, up to 23 GB BF16. Similar architecture to FLUX.
- [ ] **Flux.2** — shared-modulation transformer, partially reusable from `flux/offload.rs`

### Medium priority
- [ ] **SD3.5 Large** — 8 GB MMDiT. Would help 8 GB GPUs.
- [ ] **LTX Video** — 3.6 GB transformer but high VRAM pressure from 3D latent volume

### Lower priority (small transformers, already fit on most GPUs)
- [ ] **SDXL** — ~5 GB UNet
- [ ] **SD 1.5** — ~3 GB UNet
- [ ] **Wuerstchen** — ~3 GB combined Prior + Decoder

## Implementation Approach

### Extract shared offload infrastructure

The current `flux/offload.rs` and `qwen_image/offload.rs` are self-contained with family-specific block types. To support remaining families, extract a shared pattern:

1. **Trait-based block offloading** — define a trait that each engine's transformer blocks can implement:
   ```rust
   pub trait OffloadableBlock: Send {
       fn to_device(&mut self, device: &Device) -> Result<()>;
       fn forward(&self, input: &BlockInput) -> Result<BlockOutput>;
       fn weight_bytes(&self) -> usize;
   }
   ```

2. **Generic offload runner** — shared CPU↔GPU streaming loop with progress reporting and memory management.

3. **Per-engine block implementations** — each family implements `OffloadableBlock` for its specific block type.

### Architecture-specific considerations

- **UNet-based models** (SD1.5, SDXL): Offload at residual block level. Skip connections need careful handling.
- **MMDiT-based models** (SD3.5): Uniform blocks, straightforward streaming similar to FLUX.
- **Flow-matching transformers** (Z-Image, Flux.2, LTX Video): Most similar to existing FLUX/Qwen-Image offload.
- **Wuerstchen**: Two-stage cascade — offload each stage's blocks independently.

## References

- `crates/mold-inference/src/flux/offload.rs` — FLUX.1 block offload
- `crates/mold-inference/src/qwen_image/offload.rs` — Qwen-Image block offload
- `crates/mold-inference/src/device.rs` — VRAM detection and device placement
- #109 — async CUDA stream weight streaming (complementary optimization)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: block-level GPU offloading for all model families #188

Summary

Current State

Remaining checklist

High priority (large transformers, most benefit)

Medium priority

Lower priority (small transformers, already fit on most GPUs)

Implementation Approach

Extract shared offload infrastructure

Architecture-specific considerations

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Family	Engine	Transformer Size	Block Offload	Drop-and-Reload
FLUX.1	`FluxEngine`	7-23 GB	Yes (`flux/offload.rs`)	T5/CLIP dropped after encoding
Qwen-Image	`QwenImageEngine`	11-38 GB	Yes (`qwen_image/offload.rs`)	Qwen2.5-VL dropped after encoding
Flux.2	`Flux2Engine`	4-17 GB	No	Qwen3/T5 dropped after encoding
Z-Image	`ZImageEngine`	3-23 GB	No	Qwen3 dropped after encoding, transformer dropped for VAE
SD3.5	`SD3Engine`	2-8 GB	No	Triple encoder dropped after encoding
LTX Video	`LtxVideoEngine`	3.6 GB	No	T5 dropped after encoding, transformer dropped for VAE
SDXL	`SDXLEngine`	~5 GB	No	No
SD 1.5	`SD15Engine`	~3 GB	No	UNet dropped for VAE decode
Wuerstchen	`WuerstchenEngine`	~3 GB (Prior + Decoder)	No	Prior + Decoder dropped for VQ-GAN

feat: block-level GPU offloading for all model families #188

Description

Summary

Current State

Remaining checklist

High priority (large transformers, most benefit)

Medium priority

Lower priority (small transformers, already fit on most GPUs)

Implementation Approach

Extract shared offload infrastructure

Architecture-specific considerations

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions