Skip to content

Docker image intel/llm-scaler-vllm:0.14.0-b8.2.1 does not support Qwen3.5-35B-A3B-GPTQ-Int4 and Qwen3.5-35B-A3B-FP8 #415

@shawn9977

Description

@shawn9977

Docker Image: intel/llm-scaler-vllm:0.14.0-b8.2.1

docker run -t -d --rm
--shm-size 32g
--net=host
--ipc=host
--privileged
-e no_proxy="localhost,127.0.0.1"
-e NO_PROXY="localhost,127.0.0.1"
--cap-add=SYS_PTRACE
--cap-add=SYS_ADMIN
--security-opt seccomp=unconfined
-v /dev/dri/by-path:/dev/dri/by-path
--name=vllm-test
--device /dev/dri:/dev/dri
-v /home/intel/vllm/models:/workspace/vllm/models
--entrypoint=
intel/llm-scaler-vllm:0.14.0-b8.2.1
/bin/bash

root@intel-NucBox-EVO-T2S:~# VLLM_WORKER_MULTIPROC_METHOD=spawn python -m vllm.entrypoints.openai.api_server
--model /workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8
--max-model-len 1200
--gpu-memory-utilization 0.8
--enforce-eager
--block-size 64
--trust-remote-code
--max-num-batched-tokens 4096
--max-num-seqs 2

Qwen3.5-35B-A3B-FP8 Error

[W515 11:06:55.635458156 OperatorEntry.cpp:208] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::geometric_(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:36
new kernel: registered at /root/workspace/frameworks.ai.pytorch.ipex-gpu/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
(APIServer pid=89300) INFO 05-15 11:06:58 [api_server.py:1272] vLLM API server version 0.14.1.dev0+gb17039bcc.d20260430
(APIServer pid=89300) INFO 05-15 11:06:58 [utils.py:263] non-default args: {'model': '/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', 'trust_remote_code': True, 'max_model_len': 1200, 'enforce_eager': True, 'block_size': 64, 'gpu_memory_utilization': 0.8, 'max_num_batched_tokens': 4096, 'max_num_seqs': 2}
(APIServer pid=89300) [transformers] The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=89300) [transformers] The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=89300) INFO 05-15 11:06:58 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=89300) INFO 05-15 11:06:58 [model.py:1549] Using max model len 1200
(APIServer pid=89300) INFO 05-15 11:06:58 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=89300) WARNING 05-15 11:06:58 [_logger.py:68] max_num_batched_tokens (4096) exceeds max_num_seqs * max_model_len (2400). This may lead to unexpected behavior.
(APIServer pid=89300) [transformers] Qwen2VLImageProcessorFast is deprecated. The Fast suffix for image processors has been removed; use Qwen2VLImageProcessor instead.
(APIServer pid=89300) INFO 05-15 11:06:58 [config.py:479] Setting attention block size to 576 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=89300) INFO 05-15 11:06:58 [config.py:503] Padding mamba page size by 7.46% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=89300) INFO 05-15 11:06:58 [vllm.py:636] Asynchronous scheduling is disabled.
(APIServer pid=89300) WARNING 05-15 11:06:58 [logger.py:68] Enforce eager set, overriding optimization level to -O0
[W515 11:07:01.159592070 OperatorEntry.cpp:208] Warning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: aten::geometric
(Tensor(a!) self, float p, *, Generator? generator=None) -> Tensor(a!)
registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at /pytorch/aten/src/ATen/VmapModeRegistrations.cpp:36
new kernel: registered at /root/workspace/frameworks.ai.pytorch.ipex-gpu/build/Release/csrc/gpu/csrc/gpu/xpu/ATen/RegisterXPU_0.cpp:172 (function operator())
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [core.py:97] Initializing a V1 LLM engine (v0.14.1.dev0+gb17039bcc.d20260430) with config: model='/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', speculative_config=None, tokenizer='/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1200, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=fp8, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=89431) [transformers] Qwen2VLImageProcessorFast is deprecated. The Fast suffix for image processors has been removed; use Qwen2VLImageProcessor instead.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.67.108.172:43463 backend=xccl
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:04 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
2026:05:15-11:07:04:(89431) |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:05:15-11:07:04:(89431) |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:05:15-11:07:05:(89431) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance
2026:05:15-11:07:05:(89431) |CCL_WARN| Applying sycl-kernels, but device family is not recognized
(EngineCore_DP0 pid=89431) [transformers] The use_fast parameter is deprecated and will be removed in a future version. Use backend="torchvision" instead of use_fast=True, or backend="pil" instead of use_fast=False.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:07 [gpu_model_runner.py:3811] Starting to load model /workspace/vllm/models/Qwen/Qwen3.5-35B-A3B-FP8...
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [xpu.py:106] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [xpu.py:103] Using backend AttentionBackendEnum.IPEX for vit attention
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [mm_encoder_attention.py:89] Using AttentionBackendEnum.IPEX for MMEncoderAttention.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [fp8.py:126] DeepGEMM is disabled because the platform does not support it.
(EngineCore_DP0 pid=89431) INFO 05-15 11:07:08 [fp8.py:149] Using Triton backend for FP8 MoE
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] EngineCore failed to start.
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] Traceback (most recent call last):
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] super().init(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self._init_executor()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.driver_worker.load_model()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3830, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.model = model_loader.load_model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] model = initialize_model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 50, in initialize_model
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 1143, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.language_model = Qwen3_5MoeForCausalLM(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 890, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] super().init(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 832, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.model = Qwen3_5Model(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 305, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] old_init(self, **kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 592, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 586, in get_layer
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] return Qwen3_5DecoderLayer(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 511, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.mlp = Qwen3NextSparseMoeBlock(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 170, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.experts = SharedFusedMoE(
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/shared_fused_moe.py", line 28, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] super().init(**kwargs)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 620, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] self.quant_method: FusedMoEMethodBase = _get_quant_method()
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 612, in _get_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] quant_method = self.quant_config.get_quant_method(self, prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 222, in get_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] return self.get_xpu_quant_method(layer, prefix)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 213, in get_xpu_quant_method
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] return XPUFp8MoEMethod(fp8_config, layer)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/ipex_quant.py", line 438, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] super().init(quant_config, layer)
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1097, in init
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] assert not quant_config.is_checkpoint_fp8_serialized
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) ERROR 05-15 11:07:08 [core.py:936] AssertionError
(EngineCore_DP0 pid=89431) Process EngineCore_DP0:
(EngineCore_DP0 pid=89431) Traceback (most recent call last):
(EngineCore_DP0 pid=89431) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=89431) self.run()
(EngineCore_DP0 pid=89431) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=89431) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
(EngineCore_DP0 pid=89431) raise e
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
(EngineCore_DP0 pid=89431) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in init
(EngineCore_DP0 pid=89431) super().init(
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 106, in init
(EngineCore_DP0 pid=89431) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 101, in init
(EngineCore_DP0 pid=89431) self._init_executor()
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=89431) self.driver_worker.load_model()
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 274, in load_model
(EngineCore_DP0 pid=89431) self.model_runner.load_model(eep_scale_up=eep_scale_up)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3830, in load_model
(EngineCore_DP0 pid=89431) self.model = model_loader.load_model(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
(EngineCore_DP0 pid=89431) model = initialize_model(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 50, in initialize_model
(EngineCore_DP0 pid=89431) return model_class(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 1143, in init
(EngineCore_DP0 pid=89431) self.language_model = Qwen3_5MoeForCausalLM(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 890, in init
(EngineCore_DP0 pid=89431) super().init(vllm_config=vllm_config, prefix=prefix)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 832, in init
(EngineCore_DP0 pid=89431) self.model = Qwen3_5Model(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 305, in init
(EngineCore_DP0 pid=89431) old_init(self, **kwargs)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 592, in init
(EngineCore_DP0 pid=89431) self.start_layer, self.end_layer, self.layers = make_layers(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 606, in make_layers
(EngineCore_DP0 pid=89431) maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 586, in get_layer
(EngineCore_DP0 pid=89431) return Qwen3_5DecoderLayer(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 511, in init
(EngineCore_DP0 pid=89431) self.mlp = Qwen3NextSparseMoeBlock(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 170, in init
(EngineCore_DP0 pid=89431) self.experts = SharedFusedMoE(
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/shared_fused_moe.py", line 28, in init
(EngineCore_DP0 pid=89431) super().init(**kwargs)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 620, in init
(EngineCore_DP0 pid=89431) self.quant_method: FusedMoEMethodBase = _get_quant_method()
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 612, in _get_quant_method
(EngineCore_DP0 pid=89431) quant_method = self.quant_config.get_quant_method(self, prefix)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 222, in get_quant_method
(EngineCore_DP0 pid=89431) return self.get_xpu_quant_method(layer, prefix)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 213, in get_xpu_quant_method
(EngineCore_DP0 pid=89431) return XPUFp8MoEMethod(fp8_config, layer)
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/ipex_quant.py", line 438, in init
(EngineCore_DP0 pid=89431) super().init(quant_config, layer)
(EngineCore_DP0 pid=89431) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1097, in init
(EngineCore_DP0 pid=89431) assert not quant_config.is_checkpoint_fp8_serialized
(EngineCore_DP0 pid=89431) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=89431) AssertionError
(APIServer pid=89300) Traceback (most recent call last):
(APIServer pid=89300) File "", line 198, in _run_module_as_main
(APIServer pid=89300) File "", line 88, in _run_code
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1390, in
(APIServer pid=89300) uvloop.run(run_server(args))
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run
(APIServer pid=89300) return __asyncio.run(
(APIServer pid=89300) ^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=89300) return runner.run(main)
(APIServer pid=89300) ^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=89300) return self._loop.run_until_complete(task)
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper
(APIServer pid=89300) return await main
(APIServer pid=89300) ^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1319, in run_server
(APIServer pid=89300) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1338, in run_server_worker
(APIServer pid=89300) async with build_async_engine_client(
(APIServer pid=89300) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=89300) return await anext(self.gen)
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 173, in build_async_engine_client
(APIServer pid=89300) async with build_async_engine_client_from_engine_args(
(APIServer pid=89300) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter
(APIServer pid=89300) return await anext(self.gen)
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 214, in build_async_engine_client_from_engine_args
(APIServer pid=89300) async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 205, in from_vllm_config
(APIServer pid=89300) return cls(
(APIServer pid=89300) ^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 132, in init
(APIServer pid=89300) self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 122, in make_async_mp_client
(APIServer pid=89300) return AsyncMPClient(*client_args)
(APIServer pid=89300) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 824, in init
(APIServer pid=89300) super().init(
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 479, in init
(APIServer pid=89300) with launch_core_engines(vllm_config, executor_class, log_stats) as (
(APIServer pid=89300) File "/usr/lib/python3.12/contextlib.py", line 144, in exit
(APIServer pid=89300) next(self.gen)
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
(APIServer pid=89300) wait_for_engine_startup(
(APIServer pid=89300) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
(APIServer pid=89300) raise RuntimeError(
(APIServer pid=89300) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
root@intel-NucBox-EVO-T2S:~#

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions