intel/llm-scaler-vllm:1.4 — TP=2 fails with UR_RESULT_ERROR_DEVICE_LOST during determine_available_memory on dual Arc B50 Pro (regression vs 1.3)

<html><body>
<html><head></head><body>
<p>After upgrading from <code>intel/llm-scaler-vllm:1.3</code> to <code>intel/llm-scaler-vllm:1.4</code> (<code>0.14.0-b8.2.1</code>) with <strong>no other changes</strong>, a previously-working dual-GPU tensor-parallel (<code>-tp 2</code>) configuration on two Arc B50 Pro GPUs crashes at startup with <code>RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)</code>.</p>
<p>The crash occurs during the memory profiling step (<code>determine_available_memory</code> → <code>profile_run</code> → <code>_sync_device</code> → <code>torch.xpu.synchronize()</code>), immediately after model weights finish loading, on <strong>both</strong> TP workers simultaneously.</p>
<ul>
<li><code>1.3</code> with the <strong>identical</strong> command and compose: <strong>works.</strong></li>
<li><code>1.4</code> with <code>-tp 2</code>: <strong>fails</strong> (<code>DEVICE_LOST</code>).</li>
<li><code>1.4</code> with <code>-tp 1</code>: <strong>works</strong> (reaches KV-cache allocation cleanly — see below).</li>
</ul>
<p>This isolates the regression to the multi-GPU / inter-worker communication path (oneCCL / level-zero IPC) introduced in the <code>1.4</code> image, not to the model, the inference kernels, or the user configuration.</p>
<h2>Environment</h2>

  |  
-- | --
Image | intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1)
GPUs | 2× Intel Arc B50 Pro (Battlemage / Xe2)
Host OS | Kubuntu 26.04
Model | Qwen3-14B, --quantization sym_int4, --dtype float16
TP | --tensor-parallel-size 2
Runtime | Docker Compose, ipc: host, shm_size: 32g, apparmor=unconfined, /dev/dri mapped


<p>vLLM version reported in logs: <code>0.14.1.dev0+gb17039bcc.d20260430</code>.</p>
<h2>Steps to reproduce</h2>
<ol>
<li>Run the dual-B50 TP=2 configuration below on the <code>1.4</code> image.</li>
<li>Container loads weights successfully, then crashes during memory profiling.</li>
</ol>
<p>Serving command:</p>
<pre><code>vllm serve /llm/models/Qwen3-14B \
  --served-model-name Qwen3-14B \
  --host 0.0.0.0 --port 8000 \
  --reasoning-parser deepseek_r1 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --tensor-parallel-size 2 \
  --dtype float16 --quantization sym_int4 \
  --enforce-eager --trust-remote-code --disable-sliding-window \
  --gpu-memory-utilization 0.85 \
  --max-model-len 24576 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 16
</code></pre>
<h2>Crash log (TP=2)</h2>
<p>Weights load fine, then both workers fail at the same point:</p>
<pre><code>(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] WorkerProc hit an exception.
(Worker_TP1 pid=686) ERROR [multiproc_executor.py:822] Traceback (most recent call last):
...
  File ".../vllm/v1/worker/xpu_worker.py", line 134, in _determine_available_memory_default
    self.model_runner.profile_run()
  File ".../vllm/v1/worker/gpu_model_runner.py", line 4759, in profile_run
    self._sync_device()
  File ".../vllm/v1/worker/xpu_model_runner.py", line 35, in _sync_device
    torch.xpu.synchronize()
  File ".../torch/xpu/__init__.py", line 451, in synchronize
    return torch._C._xpu_synchronize(device)
RuntimeError: level_zero backend failed with error: 20 (UR_RESULT_ERROR_DEVICE_LOST)
</code></pre>
<p><code>EngineCore</code> then fails to start (<code>determine_available_memory</code> → <code>collective_rpc</code> → <code>get_response</code>).</p>
<h3>Relevant oneCCL warnings preceding the crash</h3>
<p>The IPC exchange mode requested via <code>CCL_ZE_IPC_EXCHANGE=sockets</code> does <strong>not</strong> take effect — it falls back to <code>drmfd</code>:</p>
<pre><code>|CCL_WARN| value of CCL_ZE_IPC_EXCHANGE changed to be sockets (default:pidfd)
|CCL_WARN| topology recognition shows PCIe connection between devices...
|CCL_WARN| pidfd is not supported, fallbacks to drmfd exchange mode
</code></pre>
<p>This drmfd fallback is suspected to be related to the failure (cross-process GPU memory-handle exchange between the two TP workers).</p>
<h2>TP=1 works (isolation)</h2>
<p>With <code>--tensor-parallel-size 1</code> (single B50) on the <strong>same 1.4 image</strong>, there is no <code>DEVICE_LOST</code>. The engine completes the full profiling run and fails only at the expected KV-cache sizing check (a capacity issue, not a fault):</p>
<pre><code>[Memory Profiling Analysis]
  &gt; Peak Allocated (Real Need)  : 11.63 GB
  &gt; Model memory usage          : 9.25 GB
  &gt; Current Reserved (Footprint): 11.64 GB
  &gt; Fragmentation (Wasted)      : 0.01 GB
ValueError: To serve at least one request with the model's max seq len (24576),
(3.75 GiB KV cache is needed, which is larger than the available KV cache memory (1.9 GiB).
Based on the available memory, the estimated maximum model length is 12480.
</code></pre>
<p>Reducing <code>--max-model-len</code> then lets TP=1 start and serve normally. <strong>So <code>1.4</code> runs fine on a single B50 — the failure is specific to the multi-GPU TP path.</strong></p>
<h2>Workarounds attempted (none resolved the TP=2 failure)</h2>
<p>All of the following were tried on the <code>1.4</code> image with <code>-tp 2</code>; every one produced the identical <code>UR_RESULT_ERROR_DEVICE_LOST</code> at the same location:</p>
<ul>
<li><code>cap_add: SYS_PTRACE</code></li>
<li><code>pid: host</code> (host PID namespace, in addition to <code>ipc: host</code> already set)</li>
<li><code>CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0</code></li>
<li><code>CCL_ZE_IPC_EXCHANGE=sockets</code></li>
<li><code>CCL_ZE_IPC_EXCHANGE=drmfd</code> (explicit)</li>
<li>Combinations of the above</li>
</ul>
<h2>Expected behaviour</h2>
<p>TP=2 on dual Arc B50 Pro should initialize and serve as it does on <code>1.3</code>.</p>
<h2>Actual behaviour</h2>
<p>TP=2 crashes during memory profiling with <code>UR_RESULT_ERROR_DEVICE_LOST</code>; TP=1 works on the same image.</p></body></html>
</body>
</html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

intel/llm-scaler-vllm:1.4 — TP=2 fails with UR_RESULT_ERROR_DEVICE_LOST during determine_available_memory on dual Arc B50 Pro (regression vs 1.3) #424

Environment

Steps to reproduce

Crash log (TP=2)

Relevant oneCCL warnings preceding the crash

TP=1 works (isolation)

Workarounds attempted (none resolved the TP=2 failure)

Expected behaviour

Actual behaviour

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development


Image	intel/llm-scaler-vllm:1.4 (0.14.0-b8.2.1)
GPUs	2× Intel Arc B50 Pro (Battlemage / Xe2)
Host OS	Kubuntu 26.04
Model	Qwen3-14B, --quantization sym_int4, --dtype float16
TP	--tensor-parallel-size 2
Runtime	Docker Compose, ipc: host, shm_size: 32g, apparmor=unconfined, /dev/dri mapped

intel/llm-scaler-vllm:1.4 — TP=2 fails with UR_RESULT_ERROR_DEVICE_LOST during determine_available_memory on dual Arc B50 Pro (regression vs 1.3) #424

Description

Environment

Steps to reproduce

Crash log (TP=2)

Relevant oneCCL warnings preceding the crash

TP=1 works (isolation)

Workarounds attempted (none resolved the TP=2 failure)

Expected behaviour

Actual behaviour

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions