Summary
When loading Qwen3.5 AWQ offline-quantized models with the llm-scaler-vllm image on PTL 358H, the memory footprint of the 9B model matches the FP8 baseline (no memory savings from AWQ), and the 35B-A3B model OOMs.
Environment
- Platform: PTL 358H (16 cores); iGPU: B390 Graphics, 12 Xe Cores, 122 TOPS
- Docker image:
intel/llm-scaler-vllm:0.14.0-b7.1
- HF models:
QuantTrio/Qwen3.5-9B-AWQ
QuantTrio/Qwen3.5-35B-A3B-AWQ
Reproduction
Entrypoint script:
TORCH_LLM_ALLREDUCE=1 \
VLLM_USE_V1=1 \
CCL_ZE_IPC_EXCHANGE=pidfd \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python3 -m vllm.entrypoints.openai.api_server \
--model ${MODEL_PATH} \
--served-model-name ${SERVED_MODEL_NAME} \
--enforce-eager \
--port 8000 \
--host 0.0.0.0 \
--trust-remote-code \
--gpu-memory-util=0.6 \
--max-num-batched-tokens=8192 \
--disable-log-requests \
--max-model-len=${MAX_MODEL_LEN} \
--block-size 64 \
--quantization awq \
-tp=1 \
--enable_prefix_caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--allow-deprecated-quantization ipex_awq
Observed behavior
QuantTrio/Qwen3.5-9B-AWQ — memory usage is roughly equivalent to the FP8 variant, i.e. AWQ appears to bring no memory reduction. See attached log: load_qwen3.5_9b_awq.log.
QuantTrio/Qwen3.5-35B-A3B-AWQ — OOM on load.
Expected behavior
AWQ quantized weights should fit in significantly less memory than FP8, and the 35B-A3B AWQ model should load within the iGPU memory budget on this platform.
Attachments
load_qwen3.5_9b_awq.log
Summary
When loading Qwen3.5 AWQ offline-quantized models with the
llm-scaler-vllmimage on PTL 358H, the memory footprint of the 9B model matches the FP8 baseline (no memory savings from AWQ), and the 35B-A3B model OOMs.Environment
intel/llm-scaler-vllm:0.14.0-b7.1QuantTrio/Qwen3.5-9B-AWQQuantTrio/Qwen3.5-35B-A3B-AWQReproduction
Entrypoint script:
TORCH_LLM_ALLREDUCE=1 \ VLLM_USE_V1=1 \ CCL_ZE_IPC_EXCHANGE=pidfd \ VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \ VLLM_WORKER_MULTIPROC_METHOD=spawn \ python3 -m vllm.entrypoints.openai.api_server \ --model ${MODEL_PATH} \ --served-model-name ${SERVED_MODEL_NAME} \ --enforce-eager \ --port 8000 \ --host 0.0.0.0 \ --trust-remote-code \ --gpu-memory-util=0.6 \ --max-num-batched-tokens=8192 \ --disable-log-requests \ --max-model-len=${MAX_MODEL_LEN} \ --block-size 64 \ --quantization awq \ -tp=1 \ --enable_prefix_caching \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --allow-deprecated-quantization ipex_awqObserved behavior
QuantTrio/Qwen3.5-9B-AWQ — memory usage is roughly equivalent to the FP8 variant, i.e. AWQ appears to bring no memory reduction. See attached log: load_qwen3.5_9b_awq.log.
QuantTrio/Qwen3.5-35B-A3B-AWQ — OOM on load.
Expected behavior
AWQ quantized weights should fit in significantly less memory than FP8, and the 35B-A3B AWQ model should load within the iGPU memory budget on this platform.
Attachments
load_qwen3.5_9b_awq.log