Docker image intel/llm-scaler-vllm:0.14.0-b8.2.1 does not support Qwen3.5-35B-A3B-GPTQ-Int4 and Qwen3.5-35B-A3B-FP8

 
Docker Image: intel/llm-scaler-vllm:0.14.0-b8.2.1


docker run -t -d --rm \
  --shm-size 32g \
  --net=host \
  --ipc=host \
  --privileged \
  -e no_proxy="localhost,127.0.0.1" \
  -e NO_PROXY="localhost,127.0.0.1" \
  --cap-add=SYS_PTRACE \
  --cap-add=SYS_ADMIN \
  --security-opt seccomp=unconfined \
  -v /dev/dri/by-path:/dev/dri/by-path \
  --name=vllm-test \
  --device /dev/dri:/dev/dri \
  -v /home/intel/vllm/models:/workspace/vllm/models \
  --entrypoint= \
  intel/llm-scaler-vllm:0.14.0-b8.2.1 \
  /bin/bash




root@intel-NucBox-EVO-T2S:~# VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server \
  --model /workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8 \
  --max-model-len 1200 \
  --gpu-memory-utilization 0.8 \
  --enforce-eager \
  --block-size 64 \
  --trust-remote-code \
  --max-num-batched-tokens 4096 \
  --max-num-seqs 2


**Qwen3.5-35B-A3B-FP8 Error**

[W515 11:06:55.635458156 OperatorEntry.cpp:208] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:36
       new kernel: registered at /root/workspace/frameworks.ai.pytorch.ipex-gpu/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
(APIServer pid=89300) INFO 05-15 11:06:58 [api_server.py:1272] vLLM API server version 0.14.1.dev0+gb17039bcc.d20260430
(APIServer pid=89300) INFO 05-15 11:06:58 [utils.py:263] non-default args: {'model': '/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', 'trust_remote_code': True, 'max_model_len': 1200, 'enforce_eager': True, 'block_size': 64, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'max_num_seqs': 2}
(APIServer pid=89300) [transformers] The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=89300) [transformers] The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=89300) INFO 05-15 11:06:58 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=89300) INFO 05-15 11:06:58 [model.py:1549] Using max model len 1200
(APIServer pid=89300) INFO 05-15 11:06:58 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=89300) WARNING 05-15 11:06:58 [_logger.py:68] max_num_batched_tokens (4096) exceeds max_num_seqs * max_model_len (2400). This may lead to unexpected behavior.
(APIServer pid=89300) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=89300) INFO 05-15 11:06:58 [config.py:479] Setting attention block size to 576 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=89300) INFO 05-15 11:06:58 [config.py:503] Padding mamba page size by 7.46% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=89300) INFO 05-15 11:06:58 [vllm.py:636] Asynchronous scheduling is disabled.
(APIServer pid=89300) WARNING 05-15 11:06:58 [_logger.py:68] Enforce eager set, overriding optimization level to -O0
[W515 11:07:01.159592070 OperatorEntry.cpp:208] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: XPU
  previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:36
       new kernel: registered at /root/workspace/frameworks.ai.pytorch.ipex-gpu/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [core.py:97] Initializing a V1 LLM engine (v0.14.1.dev0+gb17039bcc.d20260430) with config: model='/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', speculative_config=None, tokenizer='/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1200, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=89431) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.67.108.172:43463 backend=xccl
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
2026:05:15-11:07:04:(89431) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:05:15-11:07:04:(89431) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:05:15-11:07:05:(89431) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance
2026:05:15-11:07:05:(89431) |CCL_WARN| Applying sycl-kernels, but device family is not recognized
(EngineCore_DP0 pid=89431) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:07 [gpu_model_runner.py:3811] Starting to load model /workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8...
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [xpu.py:106] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [xpu.py:103] Using backend AttentionBackendEnum.IPEX for vit attention
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [mm_encoder_attention.py:89] Using AttentionBackendEnum.IPEX for MMEncoderAttention.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [fp8.py:126] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [fp8.py:149] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     super().__init__(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self._init_executor()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.driver_worker.load_model()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3830, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.model = model_loader.load_model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     model = initialize_model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 50, in initialize_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 1143, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.language_model = Qwen3_5MoeForCausalLM(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                           ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 890, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     super().__init__(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 832, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.model = Qwen3_5Model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                  ^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     old_init(self, **kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 592, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 586, in get_layer
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     return Qwen3_5DecoderLayer(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 511, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.mlp = Qwen3NextSparseMoeBlock(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 170, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.experts = SharedFusedMoE(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/shared_fused_moe.py", line 28, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     super().__init__(**kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 620, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     self.quant_method: FusedMoEMethodBase = _get_quant_method()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                                             ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 612, in _get_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     quant_method = self.quant_config.get_quant_method(self, prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 222, in get_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     return self.get_xpu_quant_method(layer, prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 213, in get_xpu_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     return XPUFp8MoEMethod(fp8_config, layer)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/ipex_quant.py", line 438, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     super().__init__(quant_config, layer)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1097, in __init__
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]     assert not quant_config.is_checkpoint_fp8_serialized
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] AssertionError
(EngineCore_DP0 pid=89431) Process EngineCore_DP0:
(EngineCore_DP0 pid=89431) Traceback (most recent call last):
(EngineCore_DP0 pid=89431)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=89431)     self.run()
(EngineCore_DP0 pid=89431)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=89431)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=89431)     raise e
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=89431)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=89431)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in __init__
(EngineCore_DP0 pid=89431)     super().__init__(
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in __init__
(EngineCore_DP0 pid=89431)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=89431)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in __init__
(EngineCore_DP0 pid=89431)     self._init_executor()
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=89431)     self.driver_worker.load_model()
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=89431)     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3830, in load_model
(EngineCore_DP0 pid=89431)     self.model = model_loader.load_model(
(EngineCore_DP0 pid=89431)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=89431)     model = initialize_model(
(EngineCore_DP0 pid=89431)             ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 50, in initialize_model
(EngineCore_DP0 pid=89431)     return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 1143, in __init__
(EngineCore_DP0 pid=89431)     self.language_model = Qwen3_5MoeForCausalLM(
(EngineCore_DP0 pid=89431)                           ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 890, in __init__
(EngineCore_DP0 pid=89431)     super().__init__(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 832, in __init__
(EngineCore_DP0 pid=89431)     self.model = Qwen3_5Model(
(EngineCore_DP0 pid=89431)                  ^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 305, in __init__
(EngineCore_DP0 pid=89431)     old_init(self, **kwargs)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 592, in __init__
(EngineCore_DP0 pid=89431)     self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=89431)                                                     ^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=89431)     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=89431)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 586, in get_layer
(EngineCore_DP0 pid=89431)     return Qwen3_5DecoderLayer(
(EngineCore_DP0 pid=89431)            ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 511, in __init__
(EngineCore_DP0 pid=89431)     self.mlp = Qwen3NextSparseMoeBlock(
(EngineCore_DP0 pid=89431)                ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 170, in __init__
(EngineCore_DP0 pid=89431)     self.experts = SharedFusedMoE(
(EngineCore_DP0 pid=89431)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/shared_fused_moe.py", line 28, in __init__
(EngineCore_DP0 pid=89431)     super().__init__(**kwargs)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 620, in __init__
(EngineCore_DP0 pid=89431)     self.quant_method: FusedMoEMethodBase = _get_quant_method()
(EngineCore_DP0 pid=89431)                                             ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 612, in _get_quant_method
(EngineCore_DP0 pid=89431)     quant_method = self.quant_config.get_quant_method(self, prefix)
(EngineCore_DP0 pid=89431)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 222, in get_quant_method
(EngineCore_DP0 pid=89431)     return self.get_xpu_quant_method(layer, prefix)
(EngineCore_DP0 pid=89431)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 213, in get_xpu_quant_method
(EngineCore_DP0 pid=89431)     return XPUFp8MoEMethod(fp8_config, layer)
(EngineCore_DP0 pid=89431)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/ipex_quant.py", line 438, in __init__
(EngineCore_DP0 pid=89431)     super().__init__(quant_config, layer)
(EngineCore_DP0 pid=89431)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1097, in __init__
(EngineCore_DP0 pid=89431)     assert not quant_config.is_checkpoint_fp8_serialized
(EngineCore_DP0 pid=89431)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) AssertionError
(APIServer pid=89300) Traceback (most recent call last):
(APIServer pid=89300)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=89300)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1390, in <module>
(APIServer pid=89300)     uvloop.run(run_server(args))
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=89300)     return __asyncio.run(
(APIServer pid=89300)            ^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=89300)     return runner.run(main)
(APIServer pid=89300)            ^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=89300)     return self._loop.run_until_complete(task)
(APIServer pid=89300)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=89300)     return await main
(APIServer pid=89300)            ^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=89300)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=89300)     async with build_async_engine_client(
(APIServer pid=89300)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=89300)     return await anext(self.gen)
(APIServer pid=89300)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=89300)     async with build_async_engine_client_from_engine_args(
(APIServer pid=89300)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=89300)     return await anext(self.gen)
(APIServer pid=89300)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=89300)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=89300)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=89300)     return cls(
(APIServer pid=89300)            ^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 132, in __init__
(APIServer pid=89300)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=89300)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=89300)     return AsyncMPClient(*client_args)
(APIServer pid=89300)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 824, in __init__
(APIServer pid=89300)     super().__init__(
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 479, in __init__
(APIServer pid=89300)     with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=89300)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=89300)     next(self.gen)
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=89300)     wait_for_engine_startup(
(APIServer pid=89300)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=89300)     raise RuntimeError(
(APIServer pid=89300) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
root@intel-NucBox-EVO-T2S:~# 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker image intel/llm-scaler-vllm:0.14.0-b8.2.1 does not support Qwen3.5-35B-A3B-GPTQ-Int4 and Qwen3.5-35B-A3B-FP8 #415

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Docker image intel/llm-scaler-vllm:0.14.0-b8.2.1 does not support Qwen3.5-35B-A3B-GPTQ-Int4 and Qwen3.5-35B-A3B-FP8 #415

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions