Misleading "Metal out of memory" error when batching against a remote CUDA server from macOS

## Summary

Running `mold run ... --batch N` from a macOS client against a **remote CUDA server** (via `MOLD_HOST`) can produce a misleading `error: Metal out of memory` message even though the OOM originated on the server's CUDA GPU, not on the local Metal device. The accompanying suggestions also assume a local GPU context.

Separately, batch 1 of a multi-image run completed successfully, but batch 2 OOM'd mid-encode without changing any parameters — suggesting some server-side VRAM growth between remote batch iterations worth investigating.

## Reproduction

Client: macOS (aarch64-darwin).  
Server: remote CUDA host reached via `MOLD_HOST`.

```
mold run qwen-image-edit-2511:q4 '<prompt>' \
    --image input.jpg --steps 20 --batch 5
```

Observed:

```
● Generating image 1/5 (seed: ...)
  ✓ Reloading Qwen2.5 encoder [4.3s]
  ✓ Encoding prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding source image (VAE) [0.1s]
  ✓ Encoding edit images (VAE) [0.1s]
  ✓ Denoising edit (20 steps) [236.2s]
✓ Saved: mold-qwen-image-edit-2511-q4-...-0.png
● Generating image 2/5 (seed: ...)
  ✓ Reloading Qwen2.5 encoder [4.2s]
  ✓ Encoding prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding negative prompt (Qwen2.5 edit) [0.1s]
  ✓ Encoding source image (VAE) [0.0s]
  ✓ Encoding edit images (VAE) [0.0s]
error: Metal out of memory

  GPU ran out of memory during generation.
  Try these fixes:

    Reduce resolution:  --width 512 --height 512
    Use a smaller model: mold run <model>:q4 "..."
  ...
```

## Bug 1 — wrong backend label in the error message

`crates/mold-cli/src/main.rs:986-1020` decides between "Metal out of memory" and "CUDA out of memory" purely on the **client** platform:

```rust
let is_metal_oom = cfg!(target_os = "macos")
    && (msg.contains("CUDA_ERROR_OUT_OF_MEMORY")
        || msg.contains("Failed to create metal resource"));
let is_cuda_oom =
    cfg!(not(target_os = "macos")) && msg.contains("CUDA_ERROR_OUT_OF_MEMORY");
```

On a macOS client talking to a remote CUDA server, `cfg!(target_os = "macos")` is always true, so a CUDA OOM from the server is labelled as `Metal out of memory`. The surrounding hints (`--width 512 --height 512`, "source image resolution is used by default") are also framed around a local GPU, not a remote one.

### Suggested fix

Route the OOM labelling through the actual source of the error:

- If the generation was dispatched via `generate_remote` (i.e. we hit an HTTP server), treat the OOM as **server-side** and either:
  - Show a neutral `Remote server GPU out of memory (<host>)` line, or
  - Use `ServerStatus.gpu_info.name` / a new `backend` field on `ServerStatus` to say "CUDA" vs "Metal".
- Only use the `cfg!(target_os = "macos")` branch for local inference (`--local` or local fallback).
- Adjust the hint block for remote OOMs (e.g. mention `MOLD_HOST` server VRAM, suggest a smaller model or reducing `--batch`, rather than `--width/--height` defaults that are already image-derived).

## Bug 2 — OOM between remote batch iterations (investigation needed)

The remote batch path in `crates/mold-cli/src/commands/generate.rs:435` is a sequential loop of `generate_remote` calls with `batch_size = 1`. The server keeps the Qwen-Image-Edit engine resident between calls (LRU cache), and the logs show identical encoder/VAE phases for batches 1 and 2 — yet batch 2 OOMs after batch 1 succeeds with the same shapes/steps.

Possible leak points to check on the server for `qwen-image-edit-2511:q4`:

- Whether the VAE device state persists as GPU-resident between requests (see `vae_on_gpu` paths around `crates/mold-inference/src/qwen_image/pipeline.rs:1375` and `:1948`), while the transformer is also reloaded.
- Whether `loaded.text_encoder.drop_weights()` at `pipeline.rs:2118` actually releases CUDA memory or leaves cached allocator fragments that compound across requests.
- Whether stale `packed_input_storage` / `img_shapes` tensors from the previous request are still reachable via any engine-level field.
- Whether the LRU cache's `ModelResidency::Parked` state still holds large GPU buffers for this family.

As a workaround, a user can invoke `mold run` separately per batch, but that defeats the purpose of `--batch`.

## Environment

- Client: macOS (aarch64-darwin)
- Server: remote CUDA host via `MOLD_HOST`
- Model: `qwen-image-edit-2511:q4` (GGUF)
- Command: `mold run ... --image ... --steps 20 --batch 5`
- Image was resized to 800x1312 (edit-mode auto-fit)

## Acceptance

- [ ] On macOS client + remote CUDA server, OOM message identifies the remote/CUDA context, not "Metal".
- [ ] Remote-mode OOM hints reference remote server tuning (batch size, model variant) rather than local-only knobs.
- [ ] Root-cause the batch-2 OOM in `qwen-image-edit-2511:q4` (repro with `--batch 2` at the same resolution/steps should either succeed or fail on batch 1 too).
- [ ] Regression tests covering (a) error formatting for remote OOM paths and (b) batch-2 memory parity for Qwen-Image-Edit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Misleading "Metal out of memory" error when batching against a remote CUDA server from macOS #241

Summary

Reproduction

Bug 1 — wrong backend label in the error message

Suggested fix

Bug 2 — OOM between remote batch iterations (investigation needed)

Environment

Acceptance

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misleading "Metal out of memory" error when batching against a remote CUDA server from macOS #241

Description

Summary

Reproduction

Bug 1 — wrong backend label in the error message

Suggested fix

Bug 2 — OOM between remote batch iterations (investigation needed)

Environment

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions