(EngineCore_DP0 pid=629674)�[0;0m ERROR 05-11 06:53:08 [core.py:938] RuntimeError: gdn_conv_fused_seq: HV (32) exceeds max v-thread slots even with WG_SIZE=64 (0). H=16
serving script:
python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--served-model-name "$MODEL_NAME" \
--dtype=float16 \
--enforce-eager \
--port 8001 \
--host 0.0.0.0 \
--trust-remote-code \
--gpu-memory-util=0.9 \
--enable-prefix-caching \
--max-num-batched-tokens 16384 \
--disable-log-requests \
--max-model-len 66560 \
--block-size 64 \
--quantization sym_int4 \
-tp=1
(EngineCore_DP0 pid=629674)�[0;0m ERROR 05-11 06:53:08 [core.py:938] RuntimeError: gdn_conv_fused_seq: HV (32) exceeds max v-thread slots even with WG_SIZE=64 (0). H=16
serving script:
python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_PATH" \
--served-model-name "$MODEL_NAME" \
--dtype=float16 \
--enforce-eager \
--port 8001 \
--host 0.0.0.0 \
--trust-remote-code \
--gpu-memory-util=0.9 \
--enable-prefix-caching \
--max-num-batched-tokens 16384 \
--disable-log-requests \
--max-model-len 66560 \
--block-size 64 \
--quantization sym_int4 \
-tp=1