Summary
Mold is currently strictly single-GPU — every device selection is hardcoded to GPU index 0. On multi-GPU hosts (e.g. dual L40S), only one GPU is utilized regardless of available hardware. Large models like Qwen-Image (60 transformer blocks, ~38GB BF16) and FLUX-dev (~23GB BF16) fall back to CPU↔GPU block offloading even when a second GPU sits idle, resulting in 3-5x slower inference.
Current Architecture
Hardcoded Device 0
crates/mold-inference/src/device.rs — create_device():
Ok(Device::new_cuda(0)?) // always GPU 0
All VRAM queries (free_vram_bytes(), vram_used_estimate()) also target only GPU 0's context. There is already a TODO comment in reclaim_gpu_memory() acknowledging this:
"NOTE: This assumes a single-GPU setup — consistent with create_device which also hardcodes device 0. If multi-GPU support is added, this should iterate over all device indices that held engine allocations."
Single-GPU Assumptions Across Engines
| Engine |
Text Encoder |
Transformer |
VAE |
| FLUX |
Dynamic GPU/CPU |
Offloaded blocks (CPU→GPU:0) |
GPU:0 |
| Qwen-Image |
Dynamic GPU/CPU + staging |
Dynamic block placement (GPU:0 + CPU) |
Dynamic GPU/CPU |
| Z-Image |
VRAM-based variant |
Drop-and-reload on GPU:0 |
Dynamic |
| SD3 |
Triple encoder on GPU:0/CPU |
GPU:0 |
GPU:0 |
| LTX-Video |
T5 on GPU:0/CPU |
GPU:0 |
GPU:0 |
Every engine follows the same pattern: measure VRAM on GPU 0, decide what fits, offload the rest to CPU. A second GPU is never considered.
Proposed Multi-GPU Strategies
Phase 1: Device Selection & Enumeration
- Add
--gpu / MOLD_GPU to select a specific GPU index (not everyone wants device 0)
- Add
mold info --gpus to enumerate available GPUs with VRAM
- Per-device VRAM queries:
free_vram_on_device(ordinal) instead of only current context
- Config support:
gpu = 1 in config.toml
Phase 2: Pipeline Parallelism (Component Placement)
Place different pipeline stages on different GPUs:
GPU:0 — Text encoder(s)
GPU:1 — Transformer + VAE
This is the lowest-hanging fruit and matches how the code already separates encoder vs. transformer lifecycle (drop-and-reload pattern). Implementation:
- Extend
create_device() to accept a device index
- Add per-component device config:
--encoder-gpu 0 --transformer-gpu 1
- Tensor transfers between devices via
tensor.to_device()
Phase 3: Block Distribution (Model Parallelism)
Distribute transformer blocks across GPUs instead of offloading to CPU:
Current (single GPU):
Blocks[0..N] on GPU:0, Blocks[N..60] on CPU (streamed 1-at-a-time)
Multi-GPU:
Blocks[0..30] on GPU:0, Blocks[30..60] on GPU:1 (no CPU roundtrip)
This directly replaces the CPU offload path with GPU-to-GPU transfer, which should be dramatically faster (NVLink/PCIe GPU↔GPU vs. system RAM roundtrip). The existing BlockResidency enum in Qwen-Image's offloader and the block-streaming architecture in FLUX's offloader are natural extension points.
Phase 4: Tensor Parallelism (Weight Sharding)
Shard individual layer weights across GPUs (e.g., split attention heads). This is the most complex strategy and likely only worthwhile for very large models. Lower priority.
Technical Considerations
cudarc Context Caching
cudarc caches CudaDevice handles per ordinal. The code already documents a segfault risk if the primary context is reset and a stale cached handle is reused (expand.rs:252-257). Multi-GPU work needs to either:
- Ensure each device's context is independently managed
- Or patch the cudarc interaction to support multi-context safely
VRAM Budgeting
Current thresholds (T5: 16GB, CLIP-G: 2.8GB, Qwen2 FP16: 16GB, etc.) assume a single VRAM pool. Multi-GPU budgeting needs aggregate and per-device awareness to make smart placement decisions.
Metal (macOS)
Metal uses unified memory — multi-GPU is not applicable on Apple Silicon. Multi-GPU paths should be CUDA-only, gated behind #[cfg(feature = "cuda")].
Environment Tested
- Dual NVIDIA L40S (48GB VRAM each, 96GB total)
- Models like Qwen-Image and FLUX-dev trigger CPU offloading despite 96GB of available GPU memory across both cards
Impact
- Large models: Qwen-Image, FLUX-dev, LTX-Video would fit entirely in GPU memory across 2 cards instead of offloading to CPU
- Inference speed: Eliminating CPU↔GPU block streaming (3-5x penalty) would bring multi-GPU inference close to eager-mode speeds
- Server throughput: Could potentially run independent requests on separate GPUs (device-affinity scheduling)
Key Files
crates/mold-inference/src/device.rs — Device creation, VRAM queries, reclaim logic
crates/mold-inference/src/flux/offload.rs — FLUX block-level CPU↔GPU streaming
crates/mold-inference/src/qwen_image/offload.rs — Qwen-Image dynamic block placement
crates/mold-inference/src/factory.rs — Engine creation (passes device to all engines)
crates/mold-server/src/state.rs — ModelCache / AppState (server-side model management)
Summary
Mold is currently strictly single-GPU — every device selection is hardcoded to GPU index 0. On multi-GPU hosts (e.g. dual L40S), only one GPU is utilized regardless of available hardware. Large models like Qwen-Image (60 transformer blocks, ~38GB BF16) and FLUX-dev (~23GB BF16) fall back to CPU↔GPU block offloading even when a second GPU sits idle, resulting in 3-5x slower inference.
Current Architecture
Hardcoded Device 0
crates/mold-inference/src/device.rs—create_device():All VRAM queries (
free_vram_bytes(),vram_used_estimate()) also target only GPU 0's context. There is already a TODO comment inreclaim_gpu_memory()acknowledging this:Single-GPU Assumptions Across Engines
Every engine follows the same pattern: measure VRAM on GPU 0, decide what fits, offload the rest to CPU. A second GPU is never considered.
Proposed Multi-GPU Strategies
Phase 1: Device Selection & Enumeration
--gpu/MOLD_GPUto select a specific GPU index (not everyone wants device 0)mold info --gpusto enumerate available GPUs with VRAMfree_vram_on_device(ordinal)instead of only current contextgpu = 1inconfig.tomlPhase 2: Pipeline Parallelism (Component Placement)
Place different pipeline stages on different GPUs:
This is the lowest-hanging fruit and matches how the code already separates encoder vs. transformer lifecycle (drop-and-reload pattern). Implementation:
create_device()to accept a device index--encoder-gpu 0 --transformer-gpu 1tensor.to_device()Phase 3: Block Distribution (Model Parallelism)
Distribute transformer blocks across GPUs instead of offloading to CPU:
This directly replaces the CPU offload path with GPU-to-GPU transfer, which should be dramatically faster (NVLink/PCIe GPU↔GPU vs. system RAM roundtrip). The existing
BlockResidencyenum in Qwen-Image's offloader and the block-streaming architecture in FLUX's offloader are natural extension points.Phase 4: Tensor Parallelism (Weight Sharding)
Shard individual layer weights across GPUs (e.g., split attention heads). This is the most complex strategy and likely only worthwhile for very large models. Lower priority.
Technical Considerations
cudarc Context Caching
cudarc caches
CudaDevicehandles per ordinal. The code already documents a segfault risk if the primary context is reset and a stale cached handle is reused (expand.rs:252-257). Multi-GPU work needs to either:VRAM Budgeting
Current thresholds (T5: 16GB, CLIP-G: 2.8GB, Qwen2 FP16: 16GB, etc.) assume a single VRAM pool. Multi-GPU budgeting needs aggregate and per-device awareness to make smart placement decisions.
Metal (macOS)
Metal uses unified memory — multi-GPU is not applicable on Apple Silicon. Multi-GPU paths should be CUDA-only, gated behind
#[cfg(feature = "cuda")].Environment Tested
Impact
Key Files
crates/mold-inference/src/device.rs— Device creation, VRAM queries, reclaim logiccrates/mold-inference/src/flux/offload.rs— FLUX block-level CPU↔GPU streamingcrates/mold-inference/src/qwen_image/offload.rs— Qwen-Image dynamic block placementcrates/mold-inference/src/factory.rs— Engine creation (passes device to all engines)crates/mold-server/src/state.rs—ModelCache/AppState(server-side model management)