Skip to content

intel/llm-scaler-vllm:1.4 — TP=2 fails with UR_RESULT_ERROR_DEVICE_LOST during determine_available_memory on dual Arc B50 Pro (regression vs 1.3) #424

@Larka16

Description

@Larka16

After upgrading from intel/llm-scaler-vllm:1.3 to intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1) with no other changes, a previously-working dual-GPU tensor-parallel (-tp 2) configuration on two Arc B50 Pro GPUs crashes at startup with RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST).

The crash occurs during the memory profiling step (determine_available_memoryprofile_run_sync_devicetorch.xpu.synchronize()), immediately after model weights finish loading, on both TP workers simultaneously.

  • 1.3 with the identical command and compose: works.
  • 1.4 with -tp 2: fails (DEVICE_LOST).
  • 1.4 with -tp 1: works (reaches KV-cache allocation cleanly — see below).

This isolates the regression to the multi-GPU / inter-worker communication path (oneCCL / level-zero IPC) introduced in the 1.4 image, not to the model, the inference kernels, or the user configuration.

Environment

   
Image intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1)
GPUs 2× Intel Arc B50 Pro (Battlemage / Xe2)
Host OS Kubuntu 26.04
Model Qwen3-14B, --quantization sym_int4, --dtype float16
TP --tensor-parallel-size 2
Runtime Docker Compose, ipc: host, shm_size: 32g, apparmor=unconfined, /dev/dri mapped

vLLM version reported in logs: 0.14.1.dev0+gb17039bcc.d20260430.

Steps to reproduce

  1. Run the dual-B50 TP=2 configuration below on the 1.4 image.
  2. Container loads weights successfully, then crashes during memory profiling.

Serving command:

vllm serve /llm/models/Qwen3-14B \
  --served-model-name Qwen3-14B \
  --host 0.0.0.0 --port 8000 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --tensor-parallel-size 2 \
  --dtype float16 --quantization sym_int4 \
  --enforce-eager --trust-remote-code --disable-sliding-window \
  --gpu-memory-utilization 0.85 \
  --max-model-len 24576 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16

Crash log (TP=2)

Weights load fine, then both workers fail at the same point:

(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] WorkerProc hit an exception.
(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] Traceback (most recent call last):
...
  File ".../vllm/v1/worker/xpu_worker.py", line 134, in _determine_available_memory_default
    self.model_runner.profile_run()
  File ".../vllm/v1/worker/gpu_model_runner.py", line 4759, in profile_run
    self._sync_device()
  File ".../vllm/v1/worker/xpu_model_runner.py", line 35, in _sync_device
    torch.xpu.synchronize()
  File ".../torch/xpu/__init__.py", line 451, in synchronize
    return torch._C._xpu_synchronize(device)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)

EngineCore then fails to start (determine_available_memorycollective_rpcget_response).

Relevant oneCCL warnings preceding the crash

The IPC exchange mode requested via CCL_ZE_IPC_EXCHANGE=sockets does not take effect — it falls back to drmfd:

|CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
|CCL_WARN| topology recognition shows PCIe connection between devices...
|CCL_WARN| pidfd is not supported, fallbacks to drmfd exchange mode

This drmfd fallback is suspected to be related to the failure (cross-process GPU memory-handle exchange between the two TP workers).

TP=1 works (isolation)

With --tensor-parallel-size 1 (single B50) on the same 1.4 image, there is no DEVICE_LOST. The engine completes the full profiling run and fails only at the expected KV-cache sizing check (a capacity issue, not a fault):

[Memory Profiling Analysis]
  > Peak Allocated (Real Need)  : 11.63 GB
  > Model memory usage          : 9.25 GB
  > Current Reserved (Footprint): 11.64 GB
  > Fragmentation (Wasted)      : 0.01 GB
ValueError: To serve at least one request with the model's max seq len (24576),
(3.75 GiB KV cache is needed, which is larger than the available KV cache memory (1.9 GiB).
Based on the available memory, the estimated maximum model length is 12480.

Reducing --max-model-len then lets TP=1 start and serve normally. So 1.4 runs fine on a single B50 — the failure is specific to the multi-GPU TP path.

Workarounds attempted (none resolved the TP=2 failure)

All of the following were tried on the 1.4 image with -tp 2; every one produced the identical UR_RESULT_ERROR_DEVICE_LOST at the same location:

  • cap_add: SYS_PTRACE
  • pid: host (host PID namespace, in addition to ipc: host already set)
  • CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
  • CCL_ZE_IPC_EXCHANGE=sockets
  • CCL_ZE_IPC_EXCHANGE=drmfd (explicit)
  • Combinations of the above

Expected behaviour

TP=2 on dual Arc B50 Pro should initialize and serve as it does on 1.3.

Actual behaviour

TP=2 crashes during memory profiling with UR_RESULT_ERROR_DEVICE_LOST; TP=1 works on the same image.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions