Skip to content

Misleading "Metal out of memory" error when batching against a remote CUDA server from macOS #241

@jamesbrink

Description

@jamesbrink

Summary

Running mold run ... --batch N from a macOS client against a remote CUDA server (via MOLD_HOST) can produce a misleading error: Metal out of memory message even though the OOM originated on the server's CUDA GPU, not on the local Metal device. The accompanying suggestions also assume a local GPU context.

Separately, batch 1 of a multi-image run completed successfully, but batch 2 OOM'd mid-encode without changing any parameters — suggesting some server-side VRAM growth between remote batch iterations worth investigating.

Reproduction

Client: macOS (aarch64-darwin).
Server: remote CUDA host reached via MOLD_HOST.

mold run qwen-image-edit-2511:q4 '<prompt>' \
    --image input.jpg --steps 20 --batch 5

Observed:

● Generating image 1/5 (seed: ...)
  ✓ Reloading Qwen2.5 encoder [4.3s]
  ✓ Encoding prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding source image (VAE) [0.1s]
  ✓ Encoding edit images (VAE) [0.1s]
  ✓ Denoising edit (20 steps) [236.2s]
✓ Saved: mold-qwen-image-edit-2511-q4-...-0.png
● Generating image 2/5 (seed: ...)
  ✓ Reloading Qwen2.5 encoder [4.2s]
  ✓ Encoding prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding source image (VAE) [0.0s]
  ✓ Encoding edit images (VAE) [0.0s]
error: Metal out of memory

  GPU ran out of memory during generation.
  Try these fixes:

    Reduce resolution:  --width 512 --height 512
    Use a smaller model: mold run <model>:q4 "..."
  ...

Bug 1 — wrong backend label in the error message

crates/mold-cli/src/main.rs:986-1020 decides between "Metal out of memory" and "CUDA out of memory" purely on the client platform:

let is_metal_oom = cfg!(target_os = "macos")
    && (msg.contains("CUDA_ERROR_OUT_OF_MEMORY")
        || msg.contains("Failed to create metal resource"));
let is_cuda_oom =
    cfg!(not(target_os = "macos")) && msg.contains("CUDA_ERROR_OUT_OF_MEMORY");

On a macOS client talking to a remote CUDA server, cfg!(target_os = "macos") is always true, so a CUDA OOM from the server is labelled as Metal out of memory. The surrounding hints (--width 512 --height 512, "source image resolution is used by default") are also framed around a local GPU, not a remote one.

Suggested fix

Route the OOM labelling through the actual source of the error:

  • If the generation was dispatched via generate_remote (i.e. we hit an HTTP server), treat the OOM as server-side and either:
    • Show a neutral Remote server GPU out of memory (<host>) line, or
    • Use ServerStatus.gpu_info.name / a new backend field on ServerStatus to say "CUDA" vs "Metal".
  • Only use the cfg!(target_os = "macos") branch for local inference (--local or local fallback).
  • Adjust the hint block for remote OOMs (e.g. mention MOLD_HOST server VRAM, suggest a smaller model or reducing --batch, rather than --width/--height defaults that are already image-derived).

Bug 2 — OOM between remote batch iterations (investigation needed)

The remote batch path in crates/mold-cli/src/commands/generate.rs:435 is a sequential loop of generate_remote calls with batch_size = 1. The server keeps the Qwen-Image-Edit engine resident between calls (LRU cache), and the logs show identical encoder/VAE phases for batches 1 and 2 — yet batch 2 OOMs after batch 1 succeeds with the same shapes/steps.

Possible leak points to check on the server for qwen-image-edit-2511:q4:

  • Whether the VAE device state persists as GPU-resident between requests (see vae_on_gpu paths around crates/mold-inference/src/qwen_image/pipeline.rs:1375 and :1948), while the transformer is also reloaded.
  • Whether loaded.text_encoder.drop_weights() at pipeline.rs:2118 actually releases CUDA memory or leaves cached allocator fragments that compound across requests.
  • Whether stale packed_input_storage / img_shapes tensors from the previous request are still reachable via any engine-level field.
  • Whether the LRU cache's ModelResidency::Parked state still holds large GPU buffers for this family.

As a workaround, a user can invoke mold run separately per batch, but that defeats the purpose of --batch.

Environment

  • Client: macOS (aarch64-darwin)
  • Server: remote CUDA host via MOLD_HOST
  • Model: qwen-image-edit-2511:q4 (GGUF)
  • Command: mold run ... --image ... --steps 20 --batch 5
  • Image was resized to 800x1312 (edit-mode auto-fit)

Acceptance

  • On macOS client + remote CUDA server, OOM message identifies the remote/CUDA context, not "Metal".
  • Remote-mode OOM hints reference remote server tuning (batch size, model variant) rather than local-only knobs.
  • Root-cause the batch-2 OOM in qwen-image-edit-2511:q4 (repro with --batch 2 at the same resolution/steps should either succeed or fail on batch 1 too).
  • Regression tests covering (a) error formatting for remote OOM paths and (b) batch-2 memory parity for Qwen-Image-Edit.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions