Skip to content

Qwen3.5 27b int4 autoaround Tool Calling Benchmark Fails on Intel Arc Pro B70 ×2 with vLLM 0.14.0-b8.2.1 #412

@edisonchan

Description

@edisonchan

Environment:
OS: Ubuntu 26.04 LTS
Hardware: Intel Arc Pro B70 ×2
Container: intel/llm-scaler-vllm:0.14.0-b8.2.1
Command used to start server:

docker run -it --rm \
  --name vllmb70 \
  --ipc=host \
  --shm-size=32gb \
  --device /dev/dri:/dev/dri \
  --privileged \
  -p 8000:8000 \
  -v /home/intel/LLM:/llm/models \
  -e VLLM_TARGET_DEVICE="xpu" \
  --entrypoint /bin/bash \
  intel/llm-scaler-vllm:0.14.0-b8.2.1 \
  -c "source /opt/intel/oneapi/setvars.sh --force && \
      python3 -m vllm.entrypoints.openai.api_server \
        --model /llm/models/Qwen3.5-27B-int4-AutoRound \
        --tokenizer /llm/models/Qwen3.5-27B-int4-AutoRound \
        --served-model-name qwen3.5-27b-int4 \
        --gpu-memory-utilization 0.92 \
        --allow-deprecated-quantization \
        --trust-remote-code \
        --port 8000 \
        --max-model-len 16384 \
        --tensor-parallel-size 2 \
        --pipeline-parallel-size 1 \
        --enforce-eager \
        --reasoning-parser qwen3 \
        --enable-auto-tool-choice \
        --tool-call-parser qwen3_xml \
        --language-model-only"

Issue:
When running:

tool-eval-bench --base-url http://127.0.0.1:8000 --perf --short

all tool-calling scenarios fail.

Observed Results:

(python-312) intel@intel:~/Downloads/py312$ tool-eval-bench --base-url http://127.0.0.1:8000 --perf --short

🔧 Tool-Call Benchmark
  Server: http://127.0.0.1:8000
  Querying http://127.0.0.1:8000/v1/models … ✓ /llm/models/Qwen3.5-27B-int4-AutoRound (alias: qwen3.5-27b-int4)

  ✓ Warm-up complete (6907 ms)
  🔍 Engine: vLLM 0.14.1.dev0+gb17039bcc.d20260430

╭───────────────────────────────────────────────────────── ⚡ llama-benchy Throughput Benchmark ──────────────────────────────────────────────────────────╮
│ /llm/models/Qwen3.5-27B-int4-AutoRound                                                                                                                  │
│ pp=[2048]  tg=[128]  depth=[0, 4096, 8192]  concurrency=[1, 2, 4]  runs=3  latency=generation                                                           │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ✓ Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 27/27 0:07:11

  llama-benchy 0.3.8.dev2+gff162bcfc
  Estimated latency: 132.1 ms

                                                                   llama-benchy Results
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
┃ Test                                 ┃     c     ┃            pp t/s ┃            tg t/s ┃          TTFT (ms) ┃         Total (ms) ┃             Tokens ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
│ pp2048 tg128 @ d0                    │    c1     │             2,936 │              11.5 │                771 │             11,789 │           2048+128 │
│ pp2048 tg128 @ d0                    │    c2     │             2,489 │              21.4 │              1,182 │             12,590 │           2048+128 │
│ pp2048 tg128 @ d0                    │    c4     │             2,563 │              38.3 │              1,854 │             13,781 │           2048+128 │
│ pp2048 tg128 @ d4096                 │    c1     │             2,706 │              11.6 │              2,170 │             13,110 │           2048+128 │
│ pp2048 tg128 @ d4096                 │    c2     │             2,509 │              19.5 │              3,425 │             15,201 │           2048+128 │
│ pp2048 tg128 @ d4096                 │    c4     │             2,536 │              29.0 │              5,794 │             19,246 │           2048+128 │
│ pp2048 tg128 @ d8192                 │    c1     │             2,548 │              11.4 │              3,778 │             14,855 │           2048+128 │
│ pp2048 tg128 @ d8192                 │    c2     │             2,435 │              17.2 │              5,900 │             18,437 │           2048+128 │
│ pp2048 tg128 @ d8192                 │    c4     │             2,442 │              22.7 │              9,742 │             24,994 │           2048+128 │
└──────────────────────────────────────┴───────────┴───────────────────┴───────────────────┴────────────────────┴────────────────────┴────────────────────┘

  ℹ Metrics sourced from llama-benchy — see https://github.com/eugr/llama-benchy for methodology.


╭──────────────────────────────────────────────────────────────── 🔧 Tool-Call Benchmark ─────────────────────────────────────────────────────────────────╮
│ /llm/models/Qwen3.5-27B-int4-AutoRound  via vllm @ http://127.0.0.1:8000                                                                                │
│ 15 scenarios  v1.7.0                                                                                                                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  ● TC-01  Direct Specialist Match         ❌ FAIL  0/2  348.1s  ttft=584ms  Did not cleanly route the request to get_weather.
  ● TC-02  Distractor Resistance           ❌ FAIL  0/2  348.1s  ttft=568ms  Did not isolate the request to get_stock_price.
  ● TC-03  Implicit Tool Need              ❌ FAIL  0/2  351.6s  ttft=567ms  Did not complete the contact lookup to email chain correctly.
  ● TC-04  Unit Handling                   ❌ FAIL  0/2  351.6s  ttft=570ms  Did not preserve the Fahrenheit instruction.
  ● TC-05  Date and Time Parsing           ❌ FAIL  0/2  350.4s  ttft=572ms  Did not create the calendar event.
  ● TC-06  Multi-Value Extraction          ❌ FAIL  0/2  350.9s  ttft=570ms  Did not split the translation request into two valid tool calls.
  ● TC-07  Search → Read → Act             ❌ FAIL  0/2  352.6s  ttft=568ms  Did not carry the file and contact data across the chain correctly.
  ● TC-08  Conditional Branching           ❌ FAIL  0/2  351.0s  ttft=571ms  Did not respect the weather-first conditional flow.
  ● TC-09  Parallel Independence           ❌ FAIL  0/2  350.5s  ttft=569ms  Missed one side of the two-part request.
  ● TC-10  Trivial Knowledge               ❌ FAIL  0/2  347.5s  ttft=568ms  Used tools or missed the basic fact.
  ● TC-11  Simple Math                     ❌ FAIL  0/2  349.1s  ttft=567ms  Did not demonstrate arithmetic restraint — 15% of 200 should be answered
without tools.
  ● TC-12  Impossible Request              ❌ FAIL  0/2  349.0s  ttft=568ms  Did not refuse the unsupported email-deletion request correctly.
  ● TC-13  Empty Results                   ❌ FAIL  0/2  348.0s  ttft=567ms  Did not adapt after the empty search response.
  ● TC-14  Malformed Response              ❌ FAIL  0/2  349.3s  ttft=565ms  Did not handle the tool error with enough integrity.
  ● TC-15  Conflicting Information         ❌ FAIL  0/2  350.3s  ttft=569ms  Did not preserve the exact searched value across tool calls.

                                                                    Category Breakdown
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Category                                            ┃         Score         ┃ Bar                                                 ┃       Earned        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ Tool Selection                                      │          0%           │ ░░░░░░░░░░░░░░░░░░░░                                │         0/6         │
│ Parameter Precision                                 │          0%           │ ░░░░░░░░░░░░░░░░░░░░                                │         0/6         │
│ Multi-Step Chains                                   │          0%           │ ░░░░░░░░░░░░░░░░░░░░                                │         0/6         │
│ Restraint & Refusal                                 │          0%           │ ░░░░░░░░░░░░░░░░░░░░                                │         0/6         │
│ Error Recovery                                      │          0%           │ ░░░░░░░░░░░░░░░░░░░░                                │         0/6         │
└─────────────────────────────────────────────────────┴───────────────────────┴─────────────────────────────────────────────────────┴─────────────────────┘

╭───────────────────────────────────────────────────────────────── 🏆 Benchmark Complete ─────────────────────────────────────────────────────────────────╮
│                                                                                                                                                         │
│    Model:  /llm/models/Qwen3.5-27B-int4-AutoRound                                                                                                       │
│    Score:  0 / 100                                                                                                                                      │
│    Rating: ★ Poor                                                                                                                                       │
│    Engine:       vLLM 0.14.1.dev0+gb17039bcc.d20260430                                                                                                  │
│    Quantization: INT4-AutoRound                                                                                                                         │
│    Max context:  16,384 tokens                                                                                                                          │
│                                                                                                                                                         │
│    ✅ 0 passed   ⚠️  0 partial   ❌ 15 failed                                                                                                           │
│    Points: 0/30                                                                                                                                         │
│                                                                                                                                                         │
│    Quality:        0/100                                                                                                                                │
│    Responsiveness: 0/100  (median turn: 350.3s)                                                                                                         │
│    Deployability:  0/100  (α=0.7)                                                                                                                       │
│    Weakest: A Tool Selection (0%)                                                                                                                       │
│                                                                                                                                                         │
│    Completed in 5248.1s  │  tool-eval-bench v1.7.0                                                                                                      │
│                                                                                                                                                         │
│    ⚡ Throughput:                                                                                                                                       │
│    Single:  2,936 pp t/s  │  11.6 tg t/s  │  TTFT 2,170ms                                                                                               │
│    c2:      2,509 pp t/s  │  21.4 tg t/s                                                                                                                │
│    c4:      2,563 pp t/s  │  38.3 tg t/s                                                                                                                │
│                                                                                                                                                         │
│    ── How this score is calculated ──                                                                                                                   │
│    • Each scenario: pass=2pt, partial=1pt, fail=0pt                                                                                                     │
│    • Category %: earned / max per category                                                                                                              │
│    • Final score: (total points / max points) × 100                                                                                                     │
│    • Deployability: 0.7×quality + 0.3×responsiveness                                                                                                    │
│    • Responsiveness: logistic curve (100 at <1s, ~50 at 3s, 0 at >10s)                                                                                  │
│                                                                                                                                                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

  📄 Full report: /home/intel/Downloads/py312/runs/2026/05/2026-05-14T13-45-17Z_c42887.md

Expected Behavior:
Tool calling should succeed in at least some scenarios (e.g., weather lookup, stock price, math refusal).
Notes:
Model: Qwen3.5-27B-int4-AutoRound
Bench tool: llama-benchy (https://github.com/eugr/llama-benchy)
All failures are tool-call related
Flags --reasoning-parser qwen3, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml were enabled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions