Skip to content

Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324#387

Open
gc-fu wants to merge 1 commit into
mainfrom
refresh-patch-qwen3_next_fix_0324
Open

Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324#387
gc-fu wants to merge 1 commit into
mainfrom
refresh-patch-qwen3_next_fix_0324

Conversation

@gc-fu
Copy link
Copy Markdown
Contributor

@gc-fu gc-fu commented Apr 27, 2026

Summary

  • Regenerate vllm/patches/vllm_for_multi_arc.patch from intel-sandbox/llm-scaler-vllm-xpu branch qwen3_next_fix_0324 (commit 1623c86) against upstream vllm v0.14.0.
  • Patch diff vs previous: +1087 / -340 lines (total 17,510 lines / 712K).
  • Verified with git apply --check against a fresh upstream v0.14.0 checkout.

What's new in this refresh

  • Qwen3.5 / Qwen3-Next: BSZ>1 decode paths, moe_forward_full(_v2) single-dispatch, ESIMD RMSNormGated and batched fused_add_rms_norm, cached decode tensor views, _prepare_inputs_decode_fast for BSZ=1.
  • INT4 ESIMD kernels: MoE integration for Qwen3.5 MoE models (incl. transpose=False weight layout), GEMV for decode path, norm_gemv / resadd_norm_gemv.
  • GDN: rollback BSZ>1 precomputed-proj path to standard gdn(), v=48 fix, missing BSZ>1 buffer/int4-flag init in Qwen3_5DecoderLayer.
  • Paged attention: GQA ratio pad support for non-4-divisible ratios; gate page_attn_decode on gqaRatio % 4 == 0 to prevent kernel assertion.
  • Misc: skip sym_int4 quantization for shared_expert (accuracy), cos_sin_cache + fp16 cache conversion, unified kernel package imports for moe_ops / eagle_ops.

Test plan

  • git apply --check vllm_for_multi_arc.patch against upstream v0.14.0 — passes.
  • Docker image built successfully: amr-registry.caas.intel.com/intelanalytics/llm-scaler-vllm:0.14.0-b8.3-0427 (34.5GB).
  • Smoke / accuracy runs on Qwen3.5 models (to follow).

🤖 Generated with Claude Code

Regenerated against upstream vllm v0.14.0 from intel-sandbox/llm-scaler-vllm-xpu
branch qwen3_next_fix_0324 (1623c86). Verified `git apply --check` against
upstream v0.14.0.

Main additions since the previous snapshot:
- Qwen3.5 / Qwen3-Next optimizations (BSZ>1 decode paths, moe_forward_full(_v2),
  ESIMD RMSNormGated / fused_add_rms_norm, cached decode tensor views)
- INT4 ESIMD kernel integrations: MoE, GEMV, norm_gemv / resadd_norm_gemv,
  including transpose=False weight-layout support
- GDN fixes: BSZ>1 precomputed-proj rollback, v=48 fix, buffer init for
  Qwen3_5DecoderLayer when BSZ>1
- Paged attention: GQA ratio pad for non-4-divisible ratios; gate
  page_attn_decode on gqaRatio % 4 == 0 to prevent kernel assertion
- Miscellaneous: skip sym_int4 quantization for shared_expert, cos_sin_cache
  + fp16 cache conversion, unified kernel package imports

Patch: 17510 lines / 712K, +1087 / -340 vs previous.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant