Summary
Running mold run ... --batch N from a macOS client against a remote CUDA server (via MOLD_HOST) can produce a misleading error: Metal out of memory message even though the OOM originated on the server's CUDA GPU, not on the local Metal device. The accompanying suggestions also assume a local GPU context.
Separately, batch 1 of a multi-image run completed successfully, but batch 2 OOM'd mid-encode without changing any parameters — suggesting some server-side VRAM growth between remote batch iterations worth investigating.
Reproduction
Client: macOS (aarch64-darwin).
Server: remote CUDA host reached via MOLD_HOST.
mold run qwen-image-edit-2511:q4 '<prompt>' \
--image input.jpg --steps 20 --batch 5
Observed:
● Generating image 1/5 (seed: ...)
✓ Reloading Qwen2.5 encoder [4.3s]
✓ Encoding prompt (Qwen2.5 edit) [0.1s]
✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
✓ Encoding source image (VAE) [0.1s]
✓ Encoding edit images (VAE) [0.1s]
✓ Denoising edit (20 steps) [236.2s]
✓ Saved: mold-qwen-image-edit-2511-q4-...-0.png
● Generating image 2/5 (seed: ...)
✓ Reloading Qwen2.5 encoder [4.2s]
✓ Encoding prompt (Qwen2.5 edit) [0.1s]
✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
✓ Encoding source image (VAE) [0.0s]
✓ Encoding edit images (VAE) [0.0s]
error: Metal out of memory
GPU ran out of memory during generation.
Try these fixes:
Reduce resolution: --width 512 --height 512
Use a smaller model: mold run <model>:q4 "..."
...
Bug 1 — wrong backend label in the error message
crates/mold-cli/src/main.rs:986-1020 decides between "Metal out of memory" and "CUDA out of memory" purely on the client platform:
let is_metal_oom = cfg!(target_os = "macos")
&& (msg.contains("CUDA_ERROR_OUT_OF_MEMORY")
|| msg.contains("Failed to create metal resource"));
let is_cuda_oom =
cfg!(not(target_os = "macos")) && msg.contains("CUDA_ERROR_OUT_OF_MEMORY");
On a macOS client talking to a remote CUDA server, cfg!(target_os = "macos") is always true, so a CUDA OOM from the server is labelled as Metal out of memory. The surrounding hints (--width 512 --height 512, "source image resolution is used by default") are also framed around a local GPU, not a remote one.
Suggested fix
Route the OOM labelling through the actual source of the error:
- If the generation was dispatched via
generate_remote (i.e. we hit an HTTP server), treat the OOM as server-side and either:
- Show a neutral
Remote server GPU out of memory (<host>) line, or
- Use
ServerStatus.gpu_info.name / a new backend field on ServerStatus to say "CUDA" vs "Metal".
- Only use the
cfg!(target_os = "macos") branch for local inference (--local or local fallback).
- Adjust the hint block for remote OOMs (e.g. mention
MOLD_HOST server VRAM, suggest a smaller model or reducing --batch, rather than --width/--height defaults that are already image-derived).
Bug 2 — OOM between remote batch iterations (investigation needed)
The remote batch path in crates/mold-cli/src/commands/generate.rs:435 is a sequential loop of generate_remote calls with batch_size = 1. The server keeps the Qwen-Image-Edit engine resident between calls (LRU cache), and the logs show identical encoder/VAE phases for batches 1 and 2 — yet batch 2 OOMs after batch 1 succeeds with the same shapes/steps.
Possible leak points to check on the server for qwen-image-edit-2511:q4:
- Whether the VAE device state persists as GPU-resident between requests (see
vae_on_gpu paths around crates/mold-inference/src/qwen_image/pipeline.rs:1375 and :1948), while the transformer is also reloaded.
- Whether
loaded.text_encoder.drop_weights() at pipeline.rs:2118 actually releases CUDA memory or leaves cached allocator fragments that compound across requests.
- Whether stale
packed_input_storage / img_shapes tensors from the previous request are still reachable via any engine-level field.
- Whether the LRU cache's
ModelResidency::Parked state still holds large GPU buffers for this family.
As a workaround, a user can invoke mold run separately per batch, but that defeats the purpose of --batch.
Environment
- Client: macOS (aarch64-darwin)
- Server: remote CUDA host via
MOLD_HOST
- Model:
qwen-image-edit-2511:q4 (GGUF)
- Command:
mold run ... --image ... --steps 20 --batch 5
- Image was resized to 800x1312 (edit-mode auto-fit)
Acceptance
Summary
Running
mold run ... --batch Nfrom a macOS client against a remote CUDA server (viaMOLD_HOST) can produce a misleadingerror: Metal out of memorymessage even though the OOM originated on the server's CUDA GPU, not on the local Metal device. The accompanying suggestions also assume a local GPU context.Separately, batch 1 of a multi-image run completed successfully, but batch 2 OOM'd mid-encode without changing any parameters — suggesting some server-side VRAM growth between remote batch iterations worth investigating.
Reproduction
Client: macOS (aarch64-darwin).
Server: remote CUDA host reached via
MOLD_HOST.Observed:
Bug 1 — wrong backend label in the error message
crates/mold-cli/src/main.rs:986-1020decides between "Metal out of memory" and "CUDA out of memory" purely on the client platform:On a macOS client talking to a remote CUDA server,
cfg!(target_os = "macos")is always true, so a CUDA OOM from the server is labelled asMetal out of memory. The surrounding hints (--width 512 --height 512, "source image resolution is used by default") are also framed around a local GPU, not a remote one.Suggested fix
Route the OOM labelling through the actual source of the error:
generate_remote(i.e. we hit an HTTP server), treat the OOM as server-side and either:Remote server GPU out of memory (<host>)line, orServerStatus.gpu_info.name/ a newbackendfield onServerStatusto say "CUDA" vs "Metal".cfg!(target_os = "macos")branch for local inference (--localor local fallback).MOLD_HOSTserver VRAM, suggest a smaller model or reducing--batch, rather than--width/--heightdefaults that are already image-derived).Bug 2 — OOM between remote batch iterations (investigation needed)
The remote batch path in
crates/mold-cli/src/commands/generate.rs:435is a sequential loop ofgenerate_remotecalls withbatch_size = 1. The server keeps the Qwen-Image-Edit engine resident between calls (LRU cache), and the logs show identical encoder/VAE phases for batches 1 and 2 — yet batch 2 OOMs after batch 1 succeeds with the same shapes/steps.Possible leak points to check on the server for
qwen-image-edit-2511:q4:vae_on_gpupaths aroundcrates/mold-inference/src/qwen_image/pipeline.rs:1375and:1948), while the transformer is also reloaded.loaded.text_encoder.drop_weights()atpipeline.rs:2118actually releases CUDA memory or leaves cached allocator fragments that compound across requests.packed_input_storage/img_shapestensors from the previous request are still reachable via any engine-level field.ModelResidency::Parkedstate still holds large GPU buffers for this family.As a workaround, a user can invoke
mold runseparately per batch, but that defeats the purpose of--batch.Environment
MOLD_HOSTqwen-image-edit-2511:q4(GGUF)mold run ... --image ... --steps 20 --batch 5Acceptance
qwen-image-edit-2511:q4(repro with--batch 2at the same resolution/steps should either succeed or fail on batch 1 too).