Skip to content

AWQ quantization models have abnormal memory usage / OOM on PTL 358H #402

@Johere

Description

@Johere

Summary

When loading Qwen3.5 AWQ offline-quantized models with the llm-scaler-vllm image on PTL 358H, the memory footprint of the 9B model matches the FP8 baseline (no memory savings from AWQ), and the 35B-A3B model OOMs.

Environment

  • Platform: PTL 358H (16 cores); iGPU: B390 Graphics, 12 Xe Cores, 122 TOPS
  • Docker image: intel/llm-scaler-vllm:0.14.0-b7.1
  • HF models:
    • QuantTrio/Qwen3.5-9B-AWQ
    • QuantTrio/Qwen3.5-35B-A3B-AWQ

Reproduction

Entrypoint script:

TORCH_LLM_ALLREDUCE=1 \
VLLM_USE_V1=1 \
CCL_ZE_IPC_EXCHANGE=pidfd \
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
python3 -m vllm.entrypoints.openai.api_server \
    --model ${MODEL_PATH} \
    --served-model-name ${SERVED_MODEL_NAME} \
    --enforce-eager \
    --port 8000 \
    --host 0.0.0.0 \
    --trust-remote-code \
    --gpu-memory-util=0.6 \
    --max-num-batched-tokens=8192 \
    --disable-log-requests \
    --max-model-len=${MAX_MODEL_LEN} \
    --block-size 64 \
    --quantization awq \
    -tp=1 \
    --enable_prefix_caching \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --allow-deprecated-quantization ipex_awq

Observed behavior

QuantTrio/Qwen3.5-9B-AWQ — memory usage is roughly equivalent to the FP8 variant, i.e. AWQ appears to bring no memory reduction. See attached log: load_qwen3.5_9b_awq.log.
QuantTrio/Qwen3.5-35B-A3B-AWQ — OOM on load.

Expected behavior

AWQ quantized weights should fit in significantly less memory than FP8, and the 35B-A3B AWQ model should load within the iGPU memory budget on this platform.

Attachments

load_qwen3.5_9b_awq.log

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions