Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324 by gc-fu · Pull Request #387 · intel/llm-scaler

gc-fu · 2026-04-27T11:11:48Z

Summary

Regenerate vllm/patches/vllm_for_multi_arc.patch from intel-sandbox/llm-scaler-vllm-xpu branch qwen3_next_fix_0324 (commit 1623c86) against upstream vllm v0.14.0.
Patch diff vs previous: +1087 / -340 lines (total 17,510 lines / 712K).
Verified with git apply --check against a fresh upstream v0.14.0 checkout.

What's new in this refresh

Qwen3.5 / Qwen3-Next: BSZ>1 decode paths, moe_forward_full(_v2) single-dispatch, ESIMD RMSNormGated and batched fused_add_rms_norm, cached decode tensor views, _prepare_inputs_decode_fast for BSZ=1.
INT4 ESIMD kernels: MoE integration for Qwen3.5 MoE models (incl. transpose=False weight layout), GEMV for decode path, norm_gemv / resadd_norm_gemv.
GDN: rollback BSZ>1 precomputed-proj path to standard gdn(), v=48 fix, missing BSZ>1 buffer/int4-flag init in Qwen3_5DecoderLayer.
Paged attention: GQA ratio pad support for non-4-divisible ratios; gate page_attn_decode on gqaRatio % 4 == 0 to prevent kernel assertion.
Misc: skip sym_int4 quantization for shared_expert (accuracy), cos_sin_cache + fp16 cache conversion, unified kernel package imports for moe_ops / eagle_ops.

Test plan

git apply --check vllm_for_multi_arc.patch against upstream v0.14.0 — passes.
Docker image built successfully: amr-registry.caas.intel.com/intelanalytics/llm-scaler-vllm:0.14.0-b8.3-0427 (34.5GB).
Smoke / accuracy runs on Qwen3.5 models (to follow).

🤖 Generated with Claude Code

Regenerated against upstream vllm v0.14.0 from intel-sandbox/llm-scaler-vllm-xpu branch qwen3_next_fix_0324 (1623c86). Verified `git apply --check` against upstream v0.14.0. Main additions since the previous snapshot: - Qwen3.5 / Qwen3-Next optimizations (BSZ>1 decode paths, moe_forward_full(_v2), ESIMD RMSNormGated / fused_add_rms_norm, cached decode tensor views) - INT4 ESIMD kernel integrations: MoE, GEMV, norm_gemv / resadd_norm_gemv, including transpose=False weight-layout support - GDN fixes: BSZ>1 precomputed-proj rollback, v=48 fix, buffer init for Qwen3_5DecoderLayer when BSZ>1 - Paged attention: GQA ratio pad for non-4-divisible ratios; gate page_attn_decode on gqaRatio % 4 == 0 to prevent kernel assertion - Miscellaneous: skip sym_int4 quantization for shared_expert, cos_sin_cache + fp16 cache conversion, unified kernel package imports Patch: 17510 lines / 712K, +1087 / -340 vs previous.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324#387

Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324#387
gc-fu wants to merge 1 commit into
mainfrom
refresh-patch-qwen3_next_fix_0324

gc-fu commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gc-fu commented Apr 27, 2026

Summary

What's new in this refresh

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant