Skip to content

Starting vllm service failed with qwen3.6 35b a3b sym_int4 on B70 #407

@HoppeDeng

Description

@HoppeDeng

(EngineCore_DP0 pid=629674)�[0;0m ERROR 05-11 06:53:08 [core.py:938] RuntimeError: gdn_conv_fused_seq: HV (32) exceeds max v-thread slots even with WG_SIZE=64 (0). H=16

serving script:
python3 -m vllm.entrypoints.openai.api_server \

    --model "$MODEL_PATH" \

    --served-model-name "$MODEL_NAME" \

    --dtype=float16 \

    --enforce-eager \

    --port 8001 \

    --host 0.0.0.0 \

    --trust-remote-code \

    --gpu-memory-util=0.9 \

    --enable-prefix-caching \

    --max-num-batched-tokens 16384 \

    --disable-log-requests \

    --max-model-len  66560 \

    --block-size 64 \

    --quantization sym_int4 \

    -tp=1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions