Skip to content

feat: Multi-GPU support for pipeline and tensor parallelism #209

@jamesbrink

Description

@jamesbrink

Summary

Mold is currently strictly single-GPU — every device selection is hardcoded to GPU index 0. On multi-GPU hosts (e.g. dual L40S), only one GPU is utilized regardless of available hardware. Large models like Qwen-Image (60 transformer blocks, ~38GB BF16) and FLUX-dev (~23GB BF16) fall back to CPU↔GPU block offloading even when a second GPU sits idle, resulting in 3-5x slower inference.

Current Architecture

Hardcoded Device 0

crates/mold-inference/src/device.rscreate_device():

Ok(Device::new_cuda(0)?)  // always GPU 0

All VRAM queries (free_vram_bytes(), vram_used_estimate()) also target only GPU 0's context. There is already a TODO comment in reclaim_gpu_memory() acknowledging this:

"NOTE: This assumes a single-GPU setup — consistent with create_device which also hardcodes device 0. If multi-GPU support is added, this should iterate over all device indices that held engine allocations."

Single-GPU Assumptions Across Engines

Engine Text Encoder Transformer VAE
FLUX Dynamic GPU/CPU Offloaded blocks (CPU→GPU:0) GPU:0
Qwen-Image Dynamic GPU/CPU + staging Dynamic block placement (GPU:0 + CPU) Dynamic GPU/CPU
Z-Image VRAM-based variant Drop-and-reload on GPU:0 Dynamic
SD3 Triple encoder on GPU:0/CPU GPU:0 GPU:0
LTX-Video T5 on GPU:0/CPU GPU:0 GPU:0

Every engine follows the same pattern: measure VRAM on GPU 0, decide what fits, offload the rest to CPU. A second GPU is never considered.

Proposed Multi-GPU Strategies

Phase 1: Device Selection & Enumeration

  • Add --gpu / MOLD_GPU to select a specific GPU index (not everyone wants device 0)
  • Add mold info --gpus to enumerate available GPUs with VRAM
  • Per-device VRAM queries: free_vram_on_device(ordinal) instead of only current context
  • Config support: gpu = 1 in config.toml

Phase 2: Pipeline Parallelism (Component Placement)

Place different pipeline stages on different GPUs:

GPU:0 — Text encoder(s)
GPU:1 — Transformer + VAE

This is the lowest-hanging fruit and matches how the code already separates encoder vs. transformer lifecycle (drop-and-reload pattern). Implementation:

  • Extend create_device() to accept a device index
  • Add per-component device config: --encoder-gpu 0 --transformer-gpu 1
  • Tensor transfers between devices via tensor.to_device()

Phase 3: Block Distribution (Model Parallelism)

Distribute transformer blocks across GPUs instead of offloading to CPU:

Current (single GPU):
  Blocks[0..N] on GPU:0, Blocks[N..60] on CPU (streamed 1-at-a-time)

Multi-GPU:
  Blocks[0..30] on GPU:0, Blocks[30..60] on GPU:1 (no CPU roundtrip)

This directly replaces the CPU offload path with GPU-to-GPU transfer, which should be dramatically faster (NVLink/PCIe GPU↔GPU vs. system RAM roundtrip). The existing BlockResidency enum in Qwen-Image's offloader and the block-streaming architecture in FLUX's offloader are natural extension points.

Phase 4: Tensor Parallelism (Weight Sharding)

Shard individual layer weights across GPUs (e.g., split attention heads). This is the most complex strategy and likely only worthwhile for very large models. Lower priority.

Technical Considerations

cudarc Context Caching

cudarc caches CudaDevice handles per ordinal. The code already documents a segfault risk if the primary context is reset and a stale cached handle is reused (expand.rs:252-257). Multi-GPU work needs to either:

  • Ensure each device's context is independently managed
  • Or patch the cudarc interaction to support multi-context safely

VRAM Budgeting

Current thresholds (T5: 16GB, CLIP-G: 2.8GB, Qwen2 FP16: 16GB, etc.) assume a single VRAM pool. Multi-GPU budgeting needs aggregate and per-device awareness to make smart placement decisions.

Metal (macOS)

Metal uses unified memory — multi-GPU is not applicable on Apple Silicon. Multi-GPU paths should be CUDA-only, gated behind #[cfg(feature = "cuda")].

Environment Tested

  • Dual NVIDIA L40S (48GB VRAM each, 96GB total)
  • Models like Qwen-Image and FLUX-dev trigger CPU offloading despite 96GB of available GPU memory across both cards

Impact

  • Large models: Qwen-Image, FLUX-dev, LTX-Video would fit entirely in GPU memory across 2 cards instead of offloading to CPU
  • Inference speed: Eliminating CPU↔GPU block streaming (3-5x penalty) would bring multi-GPU inference close to eager-mode speeds
  • Server throughput: Could potentially run independent requests on separate GPUs (device-affinity scheduling)

Key Files

  • crates/mold-inference/src/device.rs — Device creation, VRAM queries, reclaim logic
  • crates/mold-inference/src/flux/offload.rs — FLUX block-level CPU↔GPU streaming
  • crates/mold-inference/src/qwen_image/offload.rs — Qwen-Image dynamic block placement
  • crates/mold-inference/src/factory.rs — Engine creation (passes device to all engines)
  • crates/mold-server/src/state.rsModelCache / AppState (server-side model management)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions