After upgrading from intel/llm-scaler-vllm:1.3 to intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1) with no other changes, a previously-working dual-GPU tensor-parallel (-tp 2) configuration on two Arc B50 Pro GPUs crashes at startup with RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST).
The crash occurs during the memory profiling step (determine_available_memory → profile_run → _sync_device → torch.xpu.synchronize()), immediately after model weights finish loading, on both TP workers simultaneously.
1.3 with the identical command and compose: works.
1.4 with -tp 2: fails (DEVICE_LOST).
1.4 with -tp 1: works (reaches KV-cache allocation cleanly — see below).
This isolates the regression to the multi-GPU / inter-worker communication path (oneCCL / level-zero IPC) introduced in the 1.4 image, not to the model, the inference kernels, or the user configuration.
Environment
| |
|
| Image |
intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1) |
| GPUs |
2× Intel Arc B50 Pro (Battlemage / Xe2) |
| Host OS |
Kubuntu 26.04 |
| Model |
Qwen3-14B, --quantization sym_int4, --dtype float16 |
| TP |
--tensor-parallel-size 2 |
| Runtime |
Docker Compose, ipc: host, shm_size: 32g, apparmor=unconfined, /dev/dri mapped |
vLLM version reported in logs: 0.14.1.dev0+gb17039bcc.d20260430.
Steps to reproduce
- Run the dual-B50 TP=2 configuration below on the
1.4 image.
- Container loads weights successfully, then crashes during memory profiling.
Serving command:
vllm serve /llm/models/Qwen3-14B \
--served-model-name Qwen3-14B \
--host 0.0.0.0 --port 8000 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice --tool-call-parser hermes \
--tensor-parallel-size 2 \
--dtype float16 --quantization sym_int4 \
--enforce-eager --trust-remote-code --disable-sliding-window \
--gpu-memory-utilization 0.85 \
--max-model-len 24576 \
--max-num-batched-tokens 16384 \
--max-num-seqs 16
Crash log (TP=2)
Weights load fine, then both workers fail at the same point:
(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] WorkerProc hit an exception.
(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] Traceback (most recent call last):
...
File ".../vllm/v1/worker/xpu_worker.py", line 134, in _determine_available_memory_default
self.model_runner.profile_run()
File ".../vllm/v1/worker/gpu_model_runner.py", line 4759, in profile_run
self._sync_device()
File ".../vllm/v1/worker/xpu_model_runner.py", line 35, in _sync_device
torch.xpu.synchronize()
File ".../torch/xpu/__init__.py", line 451, in synchronize
return torch._C._xpu_synchronize(device)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
EngineCore then fails to start (determine_available_memory → collective_rpc → get_response).
Relevant oneCCL warnings preceding the crash
The IPC exchange mode requested via CCL_ZE_IPC_EXCHANGE=sockets does not take effect — it falls back to drmfd:
|CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
|CCL_WARN| topology recognition shows PCIe connection between devices...
|CCL_WARN| pidfd is not supported, fallbacks to drmfd exchange mode
This drmfd fallback is suspected to be related to the failure (cross-process GPU memory-handle exchange between the two TP workers).
TP=1 works (isolation)
With --tensor-parallel-size 1 (single B50) on the same 1.4 image, there is no DEVICE_LOST. The engine completes the full profiling run and fails only at the expected KV-cache sizing check (a capacity issue, not a fault):
[Memory Profiling Analysis]
> Peak Allocated (Real Need) : 11.63 GB
> Model memory usage : 9.25 GB
> Current Reserved (Footprint): 11.64 GB
> Fragmentation (Wasted) : 0.01 GB
ValueError: To serve at least one request with the model's max seq len (24576),
(3.75 GiB KV cache is needed, which is larger than the available KV cache memory (1.9 GiB).
Based on the available memory, the estimated maximum model length is 12480.
Reducing --max-model-len then lets TP=1 start and serve normally. So 1.4 runs fine on a single B50 — the failure is specific to the multi-GPU TP path.
Workarounds attempted (none resolved the TP=2 failure)
All of the following were tried on the 1.4 image with -tp 2; every one produced the identical UR_RESULT_ERROR_DEVICE_LOST at the same location:
cap_add: SYS_PTRACE
pid: host (host PID namespace, in addition to ipc: host already set)
CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0
CCL_ZE_IPC_EXCHANGE=sockets
CCL_ZE_IPC_EXCHANGE=drmfd (explicit)
- Combinations of the above
Expected behaviour
TP=2 on dual Arc B50 Pro should initialize and serve as it does on 1.3.
Actual behaviour
TP=2 crashes during memory profiling with UR_RESULT_ERROR_DEVICE_LOST; TP=1 works on the same image.
After upgrading from
intel/llm-scaler-vllm:1.3tointel/llm-scaler-vllm:1.4(0.14.0-b8.2.1) with no other changes, a previously-working dual-GPU tensor-parallel (-tp 2) configuration on two Arc B50 Pro GPUs crashes at startup withRuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST).The crash occurs during the memory profiling step (
determine_available_memory→profile_run→_sync_device→torch.xpu.synchronize()), immediately after model weights finish loading, on both TP workers simultaneously.1.3with the identical command and compose: works.1.4with-tp 2: fails (DEVICE_LOST).1.4with-tp 1: works (reaches KV-cache allocation cleanly — see below).This isolates the regression to the multi-GPU / inter-worker communication path (oneCCL / level-zero IPC) introduced in the
1.4image, not to the model, the inference kernels, or the user configuration.Environment
vLLM version reported in logs:
0.14.1.dev0+gb17039bcc.d20260430.Steps to reproduce
1.4image.Serving command:
Crash log (TP=2)
Weights load fine, then both workers fail at the same point:
EngineCorethen fails to start (determine_available_memory→collective_rpc→get_response).Relevant oneCCL warnings preceding the crash
The IPC exchange mode requested via
CCL_ZE_IPC_EXCHANGE=socketsdoes not take effect — it falls back todrmfd:This drmfd fallback is suspected to be related to the failure (cross-process GPU memory-handle exchange between the two TP workers).
TP=1 works (isolation)
With
--tensor-parallel-size 1(single B50) on the same 1.4 image, there is noDEVICE_LOST. The engine completes the full profiling run and fails only at the expected KV-cache sizing check (a capacity issue, not a fault):Reducing
--max-model-lenthen lets TP=1 start and serve normally. So1.4runs fine on a single B50 — the failure is specific to the multi-GPU TP path.Workarounds attempted (none resolved the TP=2 failure)
All of the following were tried on the
1.4image with-tp 2; every one produced the identicalUR_RESULT_ERROR_DEVICE_LOSTat the same location:cap_add: SYS_PTRACEpid: host(host PID namespace, in addition toipc: hostalready set)CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0CCL_ZE_IPC_EXCHANGE=socketsCCL_ZE_IPC_EXCHANGE=drmfd(explicit)Expected behaviour
TP=2 on dual Arc B50 Pro should initialize and serve as it does on
1.3.Actual behaviour
TP=2 crashes during memory profiling with
UR_RESULT_ERROR_DEVICE_LOST; TP=1 works on the same image.