docker run -it --rm \
--name vllmb70 \
--ipc=host \
--shm-size=32gb \
--device /dev/dri:/dev/dri \
--privileged \
-p 8000:8000 \
-v /home/intel/LLM:/llm/models \
-e VLLM_TARGET_DEVICE="xpu" \
--entrypoint /bin/bash \
intel/llm-scaler-vllm:0.14.0-b8.2.1 \
-c "source /opt/intel/oneapi/setvars.sh --force && \
python3 -m vllm.entrypoints.openai.api_server \
--model /llm/models/Qwen3.5-27B-int4-AutoRound \
--tokenizer /llm/models/Qwen3.5-27B-int4-AutoRound \
--served-model-name qwen3.5-27b-int4 \
--gpu-memory-utilization 0.92 \
--allow-deprecated-quantization \
--trust-remote-code \
--port 8000 \
--max-model-len 16384 \
--tensor-parallel-size 2 \
--pipeline-parallel-size 1 \
--enforce-eager \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--language-model-only"
all tool-calling scenarios fail.
(python-312) intel@intel:~/Downloads/py312$ tool-eval-bench --base-url http://127.0.0.1:8000 --perf --short
🔧 Tool-Call Benchmark
Server: http://127.0.0.1:8000
Querying http://127.0.0.1:8000/v1/models … ✓ /llm/models/Qwen3.5-27B-int4-AutoRound (alias: qwen3.5-27b-int4)
✓ Warm-up complete (6907 ms)
🔍 Engine: vLLM 0.14.1.dev0+gb17039bcc.d20260430
╭───────────────────────────────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────────────────────────────╮
│ /llm/models/Qwen3.5-27B-int4-AutoRound │
│ pp=[2048] tg=[128] depth=[0, 4096, 8192] concurrency=[1, 2, 4] runs=3 latency=generation │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:07:11
llama-benchy 0.3.8.dev2+gff162bcfc
Estimated latency: 132.1 ms
llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Test ┃ c ┃ pp t/s ┃ tg t/s ┃ TTFT (ms) ┃ Total (ms) ┃ Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0 │ c1 │ 2,936 │ 11.5 │ 771 │ 11,789 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c2 │ 2,489 │ 21.4 │ 1,182 │ 12,590 │ 2048+128 │
│ pp2048 tg128 @ d0 │ c4 │ 2,563 │ 38.3 │ 1,854 │ 13,781 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c1 │ 2,706 │ 11.6 │ 2,170 │ 13,110 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c2 │ 2,509 │ 19.5 │ 3,425 │ 15,201 │ 2048+128 │
│ pp2048 tg128 @ d4096 │ c4 │ 2,536 │ 29.0 │ 5,794 │ 19,246 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c1 │ 2,548 │ 11.4 │ 3,778 │ 14,855 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c2 │ 2,435 │ 17.2 │ 5,900 │ 18,437 │ 2048+128 │
│ pp2048 tg128 @ d8192 │ c4 │ 2,442 │ 22.7 │ 9,742 │ 24,994 │ 2048+128 │
└──────────────────────────────────────┴───────────┴───────────────────┴───────────────────┴────────────────────┴────────────────────┴────────────────────┘
ℹ Metrics sourced from llama-benchy — see https://github.com/eugr/llama-benchy for methodology.
╭──────────────────────────────────────────────────────────────── 🔧 Tool-Call Benchmark ─────────────────────────────────────────────────────────────────╮
│ /llm/models/Qwen3.5-27B-int4-AutoRound via vllm @ http://127.0.0.1:8000 │
│ 15 scenarios v1.7.0 │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
● TC-01 Direct Specialist Match ❌ FAIL 0/2 348.1s ttft=584ms Did not cleanly route the request to get_weather.
● TC-02 Distractor Resistance ❌ FAIL 0/2 348.1s ttft=568ms Did not isolate the request to get_stock_price.
● TC-03 Implicit Tool Need ❌ FAIL 0/2 351.6s ttft=567ms Did not complete the contact lookup to email chain correctly.
● TC-04 Unit Handling ❌ FAIL 0/2 351.6s ttft=570ms Did not preserve the Fahrenheit instruction.
● TC-05 Date and Time Parsing ❌ FAIL 0/2 350.4s ttft=572ms Did not create the calendar event.
● TC-06 Multi-Value Extraction ❌ FAIL 0/2 350.9s ttft=570ms Did not split the translation request into two valid tool calls.
● TC-07 Search → Read → Act ❌ FAIL 0/2 352.6s ttft=568ms Did not carry the file and contact data across the chain correctly.
● TC-08 Conditional Branching ❌ FAIL 0/2 351.0s ttft=571ms Did not respect the weather-first conditional flow.
● TC-09 Parallel Independence ❌ FAIL 0/2 350.5s ttft=569ms Missed one side of the two-part request.
● TC-10 Trivial Knowledge ❌ FAIL 0/2 347.5s ttft=568ms Used tools or missed the basic fact.
● TC-11 Simple Math ❌ FAIL 0/2 349.1s ttft=567ms Did not demonstrate arithmetic restraint — 15% of 200 should be answered
without tools.
● TC-12 Impossible Request ❌ FAIL 0/2 349.0s ttft=568ms Did not refuse the unsupported email-deletion request correctly.
● TC-13 Empty Results ❌ FAIL 0/2 348.0s ttft=567ms Did not adapt after the empty search response.
● TC-14 Malformed Response ❌ FAIL 0/2 349.3s ttft=565ms Did not handle the tool error with enough integrity.
● TC-15 Conflicting Information ❌ FAIL 0/2 350.3s ttft=569ms Did not preserve the exact searched value across tool calls.
Category Breakdown
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Category ┃ Score ┃ Bar ┃ Earned ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ Tool Selection │ 0% │ ░░░░░░░░░░░░░░░░░░░░ │ 0/6 │
│ Parameter Precision │ 0% │ ░░░░░░░░░░░░░░░░░░░░ │ 0/6 │
│ Multi-Step Chains │ 0% │ ░░░░░░░░░░░░░░░░░░░░ │ 0/6 │
│ Restraint & Refusal │ 0% │ ░░░░░░░░░░░░░░░░░░░░ │ 0/6 │
│ Error Recovery │ 0% │ ░░░░░░░░░░░░░░░░░░░░ │ 0/6 │
└─────────────────────────────────────────────────────┴───────────────────────┴─────────────────────────────────────────────────────┴─────────────────────┘
╭───────────────────────────────────────────────────────────────── 🏆 Benchmark Complete ─────────────────────────────────────────────────────────────────╮
│ │
│ Model: /llm/models/Qwen3.5-27B-int4-AutoRound │
│ Score: 0 / 100 │
│ Rating: ★ Poor │
│ Engine: vLLM 0.14.1.dev0+gb17039bcc.d20260430 │
│ Quantization: INT4-AutoRound │
│ Max context: 16,384 tokens │
│ │
│ ✅ 0 passed ⚠️ 0 partial ❌ 15 failed │
│ Points: 0/30 │
│ │
│ Quality: 0/100 │
│ Responsiveness: 0/100 (median turn: 350.3s) │
│ Deployability: 0/100 (α=0.7) │
│ Weakest: A Tool Selection (0%) │
│ │
│ Completed in 5248.1s │ tool-eval-bench v1.7.0 │
│ │
│ ⚡ Throughput: │
│ Single: 2,936 pp t/s │ 11.6 tg t/s │ TTFT 2,170ms │
│ c2: 2,509 pp t/s │ 21.4 tg t/s │
│ c4: 2,563 pp t/s │ 38.3 tg t/s │
│ │
│ ── How this score is calculated ── │
│ • Each scenario: pass=2pt, partial=1pt, fail=0pt │
│ • Category %: earned / max per category │
│ • Final score: (total points / max points) × 100 │
│ • Deployability: 0.7×quality + 0.3×responsiveness │
│ • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s) │
│ │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
📄 Full report: /home/intel/Downloads/py312/runs/2026/05/2026-05-14T13-45-17Z_c42887.md
Environment:
OS: Ubuntu 26.04 LTS
Hardware: Intel Arc Pro B70 ×2
Container: intel/llm-scaler-vllm:0.14.0-b8.2.1
Command used to start server:
Issue:
When running:
tool-eval-bench --base-url http://127.0.0.1:8000 --perf --shortall tool-calling scenarios fail.
Observed Results:
Expected Behavior:
Tool calling should succeed in at least some scenarios (e.g., weather lookup, stock price, math refusal).
Notes:
Model: Qwen3.5-27B-int4-AutoRound
Bench tool: llama-benchy (https://github.com/eugr/llama-benchy)
All failures are tool-call related
Flags --reasoning-parser qwen3, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml were enabled.