feat: Multi-GPU support for pipeline and tensor parallelism

## Summary

Mold is currently **strictly single-GPU** — every device selection is hardcoded to GPU index 0. On multi-GPU hosts (e.g. dual L40S), only one GPU is utilized regardless of available hardware. Large models like Qwen-Image (60 transformer blocks, ~38GB BF16) and FLUX-dev (~23GB BF16) fall back to CPU↔GPU block offloading even when a second GPU sits idle, resulting in 3-5x slower inference.

## Current Architecture

### Hardcoded Device 0

`crates/mold-inference/src/device.rs` — `create_device()`:
```rust
Ok(Device::new_cuda(0)?)  // always GPU 0
```

All VRAM queries (`free_vram_bytes()`, `vram_used_estimate()`) also target only GPU 0's context. There is already a TODO comment in `reclaim_gpu_memory()` acknowledging this:

> *"NOTE: This assumes a single-GPU setup — consistent with `create_device` which also hardcodes device 0. If multi-GPU support is added, this should iterate over all device indices that held engine allocations."*

### Single-GPU Assumptions Across Engines

| Engine | Text Encoder | Transformer | VAE |
|--------|-------------|-------------|-----|
| FLUX | Dynamic GPU/CPU | Offloaded blocks (CPU→GPU:0) | GPU:0 |
| Qwen-Image | Dynamic GPU/CPU + staging | Dynamic block placement (GPU:0 + CPU) | Dynamic GPU/CPU |
| Z-Image | VRAM-based variant | Drop-and-reload on GPU:0 | Dynamic |
| SD3 | Triple encoder on GPU:0/CPU | GPU:0 | GPU:0 |
| LTX-Video | T5 on GPU:0/CPU | GPU:0 | GPU:0 |

Every engine follows the same pattern: measure VRAM on GPU 0, decide what fits, offload the rest to CPU. A second GPU is never considered.

## Proposed Multi-GPU Strategies

### Phase 1: Device Selection & Enumeration

- Add `--gpu` / `MOLD_GPU` to select a specific GPU index (not everyone wants device 0)
- Add `mold info --gpus` to enumerate available GPUs with VRAM
- Per-device VRAM queries: `free_vram_on_device(ordinal)` instead of only current context
- Config support: `gpu = 1` in `config.toml`

### Phase 2: Pipeline Parallelism (Component Placement)

Place different pipeline stages on different GPUs:

```
GPU:0 — Text encoder(s)
GPU:1 — Transformer + VAE
```

This is the lowest-hanging fruit and matches how the code already separates encoder vs. transformer lifecycle (drop-and-reload pattern). Implementation:

- Extend `create_device()` to accept a device index
- Add per-component device config: `--encoder-gpu 0 --transformer-gpu 1`
- Tensor transfers between devices via `tensor.to_device()`

### Phase 3: Block Distribution (Model Parallelism)

Distribute transformer blocks across GPUs instead of offloading to CPU:

```
Current (single GPU):
  Blocks[0..N] on GPU:0, Blocks[N..60] on CPU (streamed 1-at-a-time)

Multi-GPU:
  Blocks[0..30] on GPU:0, Blocks[30..60] on GPU:1 (no CPU roundtrip)
```

This directly replaces the CPU offload path with GPU-to-GPU transfer, which should be dramatically faster (NVLink/PCIe GPU↔GPU vs. system RAM roundtrip). The existing `BlockResidency` enum in Qwen-Image's offloader and the block-streaming architecture in FLUX's offloader are natural extension points.

### Phase 4: Tensor Parallelism (Weight Sharding)

Shard individual layer weights across GPUs (e.g., split attention heads). This is the most complex strategy and likely only worthwhile for very large models. Lower priority.

## Technical Considerations

### cudarc Context Caching

cudarc caches `CudaDevice` handles per ordinal. The code already documents a segfault risk if the primary context is reset and a stale cached handle is reused (`expand.rs:252-257`). Multi-GPU work needs to either:
- Ensure each device's context is independently managed
- Or patch the cudarc interaction to support multi-context safely

### VRAM Budgeting

Current thresholds (T5: 16GB, CLIP-G: 2.8GB, Qwen2 FP16: 16GB, etc.) assume a single VRAM pool. Multi-GPU budgeting needs aggregate and per-device awareness to make smart placement decisions.

### Metal (macOS)

Metal uses unified memory — multi-GPU is not applicable on Apple Silicon. Multi-GPU paths should be CUDA-only, gated behind `#[cfg(feature = "cuda")]`.

## Environment Tested

- Dual NVIDIA L40S (48GB VRAM each, 96GB total)
- Models like Qwen-Image and FLUX-dev trigger CPU offloading despite 96GB of available GPU memory across both cards

## Impact

- **Large models**: Qwen-Image, FLUX-dev, LTX-Video would fit entirely in GPU memory across 2 cards instead of offloading to CPU
- **Inference speed**: Eliminating CPU↔GPU block streaming (3-5x penalty) would bring multi-GPU inference close to eager-mode speeds
- **Server throughput**: Could potentially run independent requests on separate GPUs (device-affinity scheduling)

## Key Files

- `crates/mold-inference/src/device.rs` — Device creation, VRAM queries, reclaim logic
- `crates/mold-inference/src/flux/offload.rs` — FLUX block-level CPU↔GPU streaming
- `crates/mold-inference/src/qwen_image/offload.rs` — Qwen-Image dynamic block placement
- `crates/mold-inference/src/factory.rs` — Engine creation (passes device to all engines)
- `crates/mold-server/src/state.rs` — `ModelCache` / `AppState` (server-side model management)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Multi-GPU support for pipeline and tensor parallelism #209

Summary

Current Architecture

Hardcoded Device 0

Single-GPU Assumptions Across Engines

Proposed Multi-GPU Strategies

Phase 1: Device Selection & Enumeration

Phase 2: Pipeline Parallelism (Component Placement)

Phase 3: Block Distribution (Model Parallelism)

Phase 4: Tensor Parallelism (Weight Sharding)

Technical Considerations

cudarc Context Caching

VRAM Budgeting

Metal (macOS)

Environment Tested

Impact

Key Files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Engine	Text Encoder	Transformer	VAE
FLUX	Dynamic GPU/CPU	Offloaded blocks (CPU→GPU:0)	GPU:0
Qwen-Image	Dynamic GPU/CPU + staging	Dynamic block placement (GPU:0 + CPU)	Dynamic GPU/CPU
Z-Image	VRAM-based variant	Drop-and-reload on GPU:0	Dynamic
SD3	Triple encoder on GPU:0/CPU	GPU:0	GPU:0
LTX-Video	T5 on GPU:0/CPU	GPU:0	GPU:0

feat: Multi-GPU support for pipeline and tensor parallelism #209

Description

Summary

Current Architecture

Hardcoded Device 0

Single-GPU Assumptions Across Engines

Proposed Multi-GPU Strategies

Phase 1: Device Selection & Enumeration

Phase 2: Pipeline Parallelism (Component Placement)

Phase 3: Block Distribution (Model Parallelism)

Phase 4: Tensor Parallelism (Weight Sharding)

Technical Considerations

cudarc Context Caching

VRAM Budgeting

Metal (macOS)

Environment Tested

Impact

Key Files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions