Refresh vllm_for_multi_arc.patch from qwen3_next_fix_0324#387
Open
gc-fu wants to merge 1 commit into
Open
Conversation
Regenerated against upstream vllm v0.14.0 from intel-sandbox/llm-scaler-vllm-xpu branch qwen3_next_fix_0324 (1623c86). Verified `git apply --check` against upstream v0.14.0. Main additions since the previous snapshot: - Qwen3.5 / Qwen3-Next optimizations (BSZ>1 decode paths, moe_forward_full(_v2), ESIMD RMSNormGated / fused_add_rms_norm, cached decode tensor views) - INT4 ESIMD kernel integrations: MoE, GEMV, norm_gemv / resadd_norm_gemv, including transpose=False weight-layout support - GDN fixes: BSZ>1 precomputed-proj rollback, v=48 fix, buffer init for Qwen3_5DecoderLayer when BSZ>1 - Paged attention: GQA ratio pad for non-4-divisible ratios; gate page_attn_decode on gqaRatio % 4 == 0 to prevent kernel assertion - Miscellaneous: skip sym_int4 quantization for shared_expert, cos_sin_cache + fp16 cache conversion, unified kernel package imports Patch: 17510 lines / 712K, +1087 / -340 vs previous.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vllm/patches/vllm_for_multi_arc.patchfromintel-sandbox/llm-scaler-vllm-xpubranchqwen3_next_fix_0324(commit1623c86) against upstream vllmv0.14.0.git apply --checkagainst a fresh upstreamv0.14.0checkout.What's new in this refresh
moe_forward_full(_v2)single-dispatch, ESIMDRMSNormGatedand batchedfused_add_rms_norm, cached decode tensor views,_prepare_inputs_decode_fastfor BSZ=1.transpose=Falseweight layout), GEMV for decode path,norm_gemv/resadd_norm_gemv.gdn(), v=48 fix, missing BSZ>1 buffer/int4-flag init inQwen3_5DecoderLayer.page_attn_decodeongqaRatio % 4 == 0to prevent kernel assertion.sym_int4quantization forshared_expert(accuracy),cos_sin_cache+ fp16 cache conversion, unified kernel package imports formoe_ops/eagle_ops.Test plan
git apply --check vllm_for_multi_arc.patchagainst upstreamv0.14.0— passes.amr-registry.caas.intel.com/intelanalytics/llm-scaler-vllm:0.14.0-b8.3-0427(34.5GB).🤖 Generated with Claude Code