docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334
Open
MegaStood wants to merge 157 commits into
Open
docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334MegaStood wants to merge 157 commits into
MegaStood wants to merge 157 commits into
Conversation
Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs or BMG-specific constants). This adds the infrastructure adaptations: - Dockerfile.lunar-lake: lightweight single-GPU image for iGPU - lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config - lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL) - platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60 - LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0, gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
One-command installer that handles: - oneAPI Base Toolkit via DNF - Level-Zero for Xe2 iGPU - PyTorch XPU in a venv - Patched vLLM build from source - vllm-xpu-kernels + triton-xpu - Bash aliases for quick launch (vllm-serve) No Docker required. Targets Nobara 43 / Fedora 42+. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Level-Zero packages aren't in Nobara/Fedora default repos. Now: - Adds Intel oneAPI + compute-runtime repos first - Tries multiple Level-Zero package names (varies across distros) - Uses --skip-unavailable to handle missing packages gracefully - Falls back to checking if libze_loader exists from xe driver https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Two issues on Nobara 43: 1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH. Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing. 2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12. Fix: detect version, auto-install python3.12 via dnf if needed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
setvars.sh has unbound variables and non-zero exits internally that trigger our set -euo pipefail, silently killing the script at Phase 2. Fix: temporarily set +euo pipefail around the source call, pipe output through grep to show only component init lines, then restore strict mode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The pipe operator causes `source` to run in a subshell, so all oneAPI environment variables were lost in the parent shell. Redirect to a temp file instead and grep it separately. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to talk to the xe kernel driver via Level-Zero. This is provided by the intel-compute-runtime package, which was missing from the install list. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Running from ~/llm-scaler which has a vllm/ subdirectory causes 'import vllm' to succeed even when vLLM is not installed, skipping the entire build phase. Use 'pip show vllm' instead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills on 32GB shared-memory systems. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB) - Make MAX_JOBS overridable via environment variable - Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake - Add comment reminding users to plug in before building https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…unar Lake
Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed:
- oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface")
Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround
- --device xpu is not a valid vLLM CLI flag; device must be set via
VLLM_TARGET_DEVICE=xpu environment variable
- xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1)
- Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB
shared memory (peak memory doubles due to INT4→FP16 unpacking)
- vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files)
- GPU crash (DEVICE_LOST) after OOM requires full system reboot
Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake,
LUNAR_LAKE_COMPATIBILITY.md
https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…g results Add detailed model compatibility matrix based on real testing on MSI Claw 8 (Core Ultra 7 258V, Arc 140V). Key findings: - Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention) - Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors) - Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB) - Only FP16 base + online quantization works reliably https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x): - Single-request: 17.6 tok/s generation - Batched peak: 90 tok/s (5 concurrent short requests) - Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total - Model loads in 7s using only 5.69 GiB https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results
…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V: - Venv creation with shared XPU PyTorch from vLLM install - transformers==4.57.3 pinning (newer versions break qwen-tts) - Voice cloning example with generate_voice_clone() API - Tested and confirmed working on MSI Claw 8 AI+ https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table - Add Qwen3-ASR-1.7B setup, serve, and test documentation - Use 127.0.0.1 (localhost) since OpenClaw accesses services locally - ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- New install script install_meteor_arrow_lake.sh supporting: - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45 - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67 - Auto-detects platform and GPU via PCI device IDs - Warns Meteor Lake users to switch from i915 to xe driver - Adjusts memory recommendations based on system RAM (16-96GB) - Updated evaluation script detect_gpu() for all three platforms - Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… transformers 5.x catch-22 - Add server-side engine log observations: generation throughput decay with context length, prefix cache hit rates, KV cache usage per concurrent request - Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x Qwen2VLImageProcessor.max_pixels API break - Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported - Document transformers version catch-22: 4.x can't recognize new architectures, 5.x breaks vLLM multimodal code - Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…rc.patch Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support (min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by transformers 5.x renaming image_processor.max_pixels to size["longest_edge"]. Added one-line getattr() fix for native installs without the full patch. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB — well within Lunar Lake's 28.6 GB budget) after applying the max_pixels getattr() fix, but crashes during warmup on the Triton kernel in fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for Qwen3.5 routes through fla/ops layernorm which requires @triton.jit. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
CUTLASS INT4 packs nibbles from different output columns in each byte, not sequential nibbles along the input dimension like GPTQ. Conversion needs full layout transform, not just int32→uint8 repack. Inference pipeline works (no crash) but output is garbled due to this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full patch collection for running INT4 GPTQ/AutoRound MoE models on Intel Lunar Lake Xe2 iGPU with vLLM 0.14 + vllm-xpu-kernels. New patches: - xpu_gptq_moe_cutlass_sequential.patch: CUTLASS sequential expert loop bypassing broken IPEX GatedMLPMOE (BREAKTHROUGH fix) - eagle_ops_dtype_check.patch: skip ESIMD decode for non-FP16 query - ipex_attention_dtype_cast.patch: cast k/v to match q dtype - xpu_attention_direct_call.patch: bypass torch.ops.vllm attention - xpu_moe_direct_call.patch: bypass torch.ops.vllm MoE dispatch - xpu_worker_skip_profile_and_warmup.patch: skip profile + warmup Updated patches: - ipex_marlin_shuffle_skip_preshuffled.patch: full IPEX __init__ bypass - mla_xpu_return_attn_probs.patch: regenerated from current source Removed (already in source tree): - autoround_fusedmoe_ipex_routing.patch (Bug A) - gemma4_moe_top_k_experts.patch (Bug B) - gptq_math_ceil_alignment.patch (Bug C) Status: Inference pipeline completes without GPU crash. Output is garbled due to GPTQ→CUTLASS INT4 weight format mismatch (column- interleaved vs row-packed nibbles). Weight conversion fix is next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ASS INT4 kernel The _gptq_int32_to_uint8 conversion was missing a K<->N transpose step. GPTQ stores qweight as [E, K/8, N] int32 (K-major), but CUTLASS grouped GEMM expects [E, N, K/2] uint8 (N-major). The old implementation only repacked nibbles along the existing K-major axis, producing [E, K/2, N] which caused DEVICE_LOST or garbage output with real model weights while appearing to work with zeros/random data. New implementation: unpack int32 -> [E, K, N] int8, transpose to [E, N, K], then repack nibble pairs (low = even-K, high = odd-K) -> [E, N, K/2] uint8. Layout now matches vllm-xpu-kernels docstring in fused_moe_interface.py. Documented symmetric-ZP constraint (assumes sym=True — AutoRound default). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Adds a second backend for INT4 MoE selectable via VLLM_XPU_MOE_BACKEND: - "cutlass" (default): existing chunked grouped GEMM path - "onednn": per-expert int4_gemm_w4a16 via torch.ops._xpu_C The oneDNN path uses GPTQ's native int32 [K/8, N] layout — no uint8 conversion, no shuffling, no implement_zp. Weights just need K-contiguous strides (via transpose trick) and a scalar zero-point of 8 for symmetric quantization. Slower than chunked CUTLASS (one kernel call per expert per gemm) but useful as a fallback when CUTLASS has issues on specific model shapes. Both paths still bypass the broken IPEX GatedMLPMOE (Bug H / IPEX #838). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… backend - Explains the missing K<->N transpose that caused DEVICE_LOST with real model weights (worked with zeros/random due to symmetric data). - Documents the new VLLM_XPU_MOE_BACKEND=onednn alternative path using int4_gemm_w4a16 with native GPTQ int32 layout as a fallback. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Correct weight conversion using int4_gemm_w4a16 (oneDNN format): - No marlin_shuffle, no implement_zp, no uint8 conversion - GPTQ int32 weights used directly with K-contiguous strides - Scalar zero_point=8 for symmetric GPTQ - 0.999997 correlation with CPU reference dequant Inference produces real Chinese text (not garbled), 50 tokens generated, HTTP 200 OK. Performance is 0.16 tok/s (sequential expert loop — optimization needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimized apply(): pre-compute active experts via .unique(), use index_add_() for scatter-accumulate, fix dtype cast. 315s → 41s for 30 tokens. Generation: 0.16 → 0.9 tok/s. Kernel benchmarks show 30μs per int4_gemm_w4a16 call — the bottleneck is now Python loop overhead (88% of total time). Updated xpu_gptq_moe_int4_w4a16.patch with optimized apply(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirmed: GPT-OSS-20B runs via GatedMLPMOE(is_mxfp4=True) on IPEX marlin backend. Same GatedMLPMOE crashes with is_int4=True but works with is_mxfp4=True — the bug is in IPEX's INT4 kernel path. Benchmark: MXFP4 gets 3.0 tok/s (batched), INT4 gets 0.9 tok/s (sequential). ~6.5x faster per layer due to grouped GEMM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both INT4 and MXFP4 CUTLASS grouped GEMM crash with DEVICE_LOST at 16+ active expert groups. IPEX moe_gemm handles 32 experts fine. The bug is in CUTLASS's Level Zero dispatch, not data format. IPEX INT4 crashes for a separate reason (kernel bug in INT4 dequant). GPT-OSS-20B benchmark: 5.4 tok/s gen, 14.2 tok/s prompt via IPEX. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1 not real INT4 cutlass_grouped_gemm_interface(is_B_int4=True) treats nibbles as FP4 E2M1 microscaling format (2 exponent + 1 mantissa bit), NOT integer 4-bit. Value 15 maps to NaN, other values are exponentially wrong. The entire CUTLASS INT4 path (implement_zp, uint8 repack, chunked dispatch) produces garbage output with real GPTQ model weights. int4_gemm_w4a16 (oneDNN) is the correct kernel — reads native GPTQ int32 weights and does proper (nibble - zp) * scale dequantization. 30μs/call, 2x faster than CUTLASS, 40x faster than BF16 at M=1. Changes: - Default VLLM_XPU_MOE_BACKEND switched from "cutlass" to "onednn" - CUTLASS INT4 path marked as broken in code comments - Bug H docs rewritten with FP4 vs INT4 comparison table, correct kernel identification, and throughput analysis https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ke-compatibility-CB5w6 # Conflicts: # vllm/patches/vllm_for_multi_arc.patch
Bug H has two independent failures: 1. GatedMLPMOE crash (SOLVED) — topk_softmax + wrapper INT4 path, bypassed with torch.topk + direct IPEX ops 2. group_mm_int4_out_marlin returns zeros/NaN on Xe2 (UNSOLVED) — kernel broken, blocks batched path. 4-layout test proves no weight layout fix possible. Correct old "Level Zero context pollution" theory — crash was in GatedMLPMOE/topk_softmax, not attention corrupting driver state. Patch cleanup: - Delete xpu_gptq_moe_cpu_shuffle.patch (superseded by int4_w4a16) - Delete xpu_gptq_moe_cutlass_sequential.patch (CUTLASS is FP4 E2M1) - Delete xpu_skip_warmup_dummy_run.patch (subset of superset patch) - Delete vllm_xpu_worker_skip_profile.patch (subset of superset patch) - Add Bug J (eagle_ops dtype) and Bug K (direct-call workarounds) - Fix Bugs A/B/C patch refs (removed — fixed upstream) - Update status table with all 11 bugs (A-K) https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Critical fixes: - transformers>=5.5.0: graceful fallback to 4.x if 5.x not on PyPI - triton: use --force-reinstall instead of uninstall+install to avoid leaving venv without any triton on network failure - MAX_JOBS: default 4 (was 6) to avoid OOM on 32GB Lunar Lake - triton-xpu: unpin version, let pip resolve torch 2.10 compatibility Significant fixes: - Add root guard (EUID check) — sudo ./install would put venv in /root/ - Add GPU access preflight (/dev/dri/renderD128 permission check) - Python version: check >=3.10 AND <=3.12 (was only <=3.12) - sed xpu_worker patch: replace fragile range pattern with Python regex - MKL LD_PRELOAD: add to vllm-v19-activate alias so it persists - torch XPU check: use stable torch.xpu.is_available() API - Document RHEL repo usage on Fedora with verification note https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
32GB swap is configured, so peak 24GB RAM from 6 parallel SYCL compile units won't OOM, just spill to swap. Build stays fast. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Three alignments with vLLM 0.19.0's own requirements/xpu.txt and docs: 1. torch: install torch==2.10.0+xpu via --extra-index-url (matches the exact pin in requirements/xpu.txt). The previous torch==2.10.0 + --index-url worked but is not the official pinning. 2. vllm-xpu-kernels: Phase 4's `pip install -r requirements/xpu.txt` already installs the pre-built wheel vllm_xpu_kernels==0.1.4 (pinned in upstream). The 1.5-2 hour source build of v0.1.5 in Phase 5 is now OPT-IN via VLLM_BUILD_XPU_KERNELS=1. Default install is ~30 min instead of ~2 hours. 3. triton-xpu: pin to 3.6.0 (required for torch 2.10 per v0.19 docs) and switch from --extra-index-url=.../whl/test/xpu (rolling test index) to --extra-index-url=.../whl/xpu (stable). test/ may ship 3.7.x which doesn't pair correctly with torch 2.10. Refs: - https://github.com/vllm-project/vllm/blob/v0.19.0/requirements/xpu.txt - https://github.com/vllm-project/vllm/blob/v0.19.0/docs/getting_started/installation/gpu.xpu.inc.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unar Lake findings Installer fix: - vLLM 0.19.0's requirements/xpu.txt pins vllm_xpu_kernels==0.1.4 which does NOT recognise Lunar Lake as XE2 (is_xe2_arch() only matches bmg_g21, bmg_g31, and pvc; intel_gpu_lnl_m is missing). First inference request fails with "Only XE2 cutlass kernel is supported currently." and kills the engine. - v0.1.5 adds intel_gpu_lnl_m to the allowlist. Pre-built wheel available on GitHub releases — no source build needed. - Phase 4 now force-reinstalls v0.1.5 right after requirements/xpu.txt. Quick-start example updated: - Shows --kv-cache-memory-bytes workaround (v0.19 replacement for our v0.14 VLLM_SKIP_PROFILE_RUN=1 patch — v0.19's MXFP4 profile-run peak busts the 28 GiB Lunar Lake budget even at max_num_seqs=4). - Adds drop_caches / reboot hints for the XPU shared-memory leak that accumulates across crashed launches. Findings doc (issues/vllm-v19-lunar-lake-findings.md) captures: - Exact v0.1.4 vs v0.1.5 diff in csrc/utils.h::is_xe2_arch - Why --kv-cache-memory-bytes is needed on Lunar Lake - XPU shared-memory leak symptom and workaround - What's carried forward from v0.14 vs what's new in v0.19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end verification after v0.1.5 wheel fix (post-reboot):
- GPT-OSS-20B MXFP4 serving on v0.19 works.
- vllm bench serve (single-user, --ignore-eos):
128 in/out: 19.1 tok/s avg, 43 ms TPOT
1024 in/out: 15.2 tok/s avg, 67 ms TPOT
- Steady-state decode ≈ 23 tok/s — on par with v0.14 IPEX
GatedMLPMOE(is_mxfp4=True) baseline of ~22 tok/s. MXFP4 gets no
meaningful speedup on v0.19; real payoff is expected for INT4
AutoRound MoE (not tested yet).
Added openclaw integration config (matches local/gpt-oss-20b provider
on 127.0.0.1:8080, uses --served-model-name so short ids resolve).
Expanded Finding 3 (XPU shared-memory leak):
- Not confined to SIGKILL — Ctrl+C on graceful vllm serve also leaks.
- Observed pattern of degrading XPU free across the session.
- `sync && echo 3 > drop_caches` usually recovers enough to relaunch;
below ~15 GiB free after cache drop indicates a real L0 mapping
leak and only reboot resets it.
Memory budget math at 64k context documented (48 KB/token/layer).
KV-cache-memory-bytes config for single-seq vs two-seq worked out.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kernel
Tonight's finding: vllm_xpu_kernels v0.1.5 has a real, correct INT4 MoE
kernel (cutlass_grouped_gemm_interface with is_B_int4=True). Verified
numerically against a CPU fp32 dequant+matmul reference using Qwen3-VL-
30B-A3B's layer 0 expert 0 gate_proj weights:
rel error: 0.003 (bf16 noise floor)
correlation: 1.000
range: [-3.27, 3.09] vs ref [-3.26, 3.10]
v0.1.4 had `is_B_int4` actually routing to FP4 E2M1 (documented bug in
earlier findings). v0.1.5 separates the two via scale dtype: uint8 B with
bf16/fp16 scales = real INT4; uint8 B with uint8 E8M0 scales = MXFP4.
vLLM 0.19 itself doesn't wire this up — the default GPTQ MoE path calls
CUDA-only torch.ops._C.gptq_shuffle, the AutoRound path raises explicit
NotImplementedError during 'xpu kernel migration', and AWQ also hits the
missing gptq_marlin_repack kernel. RFC #33214 confirms int4_moe_support
has no PR yet.
Patch adds an XPU branch to GPTQMarlinMoEMethod that:
- process_weights_after_loading: GPTQ int32-packed [E, K/8, 2N]
-> uint8 2's-complement [E, 2N, K/2]; scales transposed to
[E, 2N, K/GS]. All done per-expert on CPU to avoid XPU allocator
pressure, original weights freed before new ones materialize.
- apply: skips upstream CUDA kernel, uses topk -> moe_rows_counts ->
moe_scatter -> cutlass_grouped_gemm(is_B_int4=True) x2 with silu_and_mul
-> moe_gather pipeline.
Still to do before this can ship:
- Apply patch to a live v0.19 venv and test Qwen3-30B-a3b-gptq-int4
end-to-end
- Verify the xpu_int4_ready flag is reached before the modular XPUExperts
path takes over
- Handle GPTQ asym mode (desc_act, non-default qzeros)
The isolated-kernel test is the crucial bit. Wiring is the cheap part —
our v0.14 patch a7ee41d followed the same shape against a broken kernel
(v0.1.4 FP4) and got garbled output; against the now-verified v0.1.5
kernel it should finally produce coherent output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First end-to-end success for a non-MXFP4 MoE model on vLLM 0.19 + Intel
Arc 140V. Verified with Qwen3-30B-a3b-gptq-int4:
weights load: 15.56 GiB in 40.1s
decode: 100 tokens in 8.67s = 11.5 tok/s
output: coherent English, no garbling
Prior state on v0.14 + our patches (Python sequential expert loop via
int4_gemm_w4a16): 0.9 tok/s. This is a ~12.8x speedup — the batched
CUTLASS int4 kernel we verified earlier (v0.1.5, rel=0.003, corr=1.000)
now delivers in practice.
Two patches combined:
1. xpu_gptq_awq_linear_int4_pr33662.patch
Upstream PR #33662 ("[XPU][3/N] add int4 gemm support for xpu").
Covers LINEAR gptq/awq layers (q/k/v/o_proj, lm_head, etc.) which
previously tripped on CUDA-only torch.ops._C.gptq_shuffle. Uses
vllm_xpu_kernels v0.1.5's GPTQUtils / transpose_onednn_woq_format
helpers + int4_gemm_w4a16 dispatch.
2. xpu_gptq_moe_int4_cutlass_v19.patch (updated from draft, now a
clean diff against v0.19.0 upstream)
Our MoE half — adds XPU branch to GPTQMarlinMoEMethod:
process_weights_after_loading: int32-packed GPTQ -> uint8
2's-comp [E, 2N, K/2], scales [E, 2N, K/GS]
apply: topk -> moe_rows_counts -> moe_scatter ->
cutlass_grouped_gemm(is_B_int4=True) x2 + silu_and_mul ->
moe_gather
PR #33662 alone cannot load a GPTQ MoE model (linear layers prep works
but MoE hits gptq_shuffle). Our MoE patch alone can't boot a model
whose linear layers also use GPTQ (same trip point at load). Both
required.
Remaining work:
- AutoRound routing (Bug-A-style) to send XPU FusedMoE to
GPTQMarlinMoEMethod, bypassing 'INC not supported during xpu
kernel migration'. Unlocks Qwen3-VL-30B, GLM-4.7-Flash, Qwen3.5.
- AWQ MoE (analog of our GPTQ MoE patch).
- GPTQ asym (desc_act) qzeros handling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…of INC stub
AutoRound's auto_round:auto_gptq and auto_round:auto_awq packing formats
are bit-compatible with plain GPTQ and AWQ. INCConfig.apply_gptq_quant_layer
already constructs GPTQMarlinMoEMethod / GPTQMarlinLinearMethod for these
models — the only bug is get_quant_method short-circuiting XPU to
apply_ipex_quant_layer which raises NotImplementedError during the XPU
kernel migration.
Patch adds a 10-line XPU branch in get_quant_method: route AutoRound to
apply_gptq_quant_layer or apply_awq_quant_layer directly, before the
legacy IPEX dispatch. Both destinations have XPU-aware apply() now:
- Linear layers: via PR #33662 (int4_gemm_w4a16)
- GPTQ MoE: via our xpu_gptq_moe_int4_cutlass_v19.patch
(cutlass_grouped_gemm_interface with is_B_int4=True)
End-to-end test with Qwen3-VL-30B-A3B blocked by XPU shared-memory leak
from the successful Qwen3-30B GPTQ run (preflight reported 0.75 GiB free
— needs reboot to clear). Syntax verified; routing logic is a direct
mirror of v0.14's autoround_fusedmoe_ipex_routing.patch (Bug A) which
went through the same path.
Three patches now compose the XPU INT4 MoE stack on v0.19:
1. xpu_gptq_awq_linear_int4_pr33662.patch (upstream PR #33662)
2. xpu_gptq_moe_int4_cutlass_v19.patch (our MoE add)
3. xpu_autoround_route_to_gptq_awq_v19.patch (this — routing shim)
AWQ MoE (qwen3-coder-30b-a3b-awq, glm-4.7-flash-awq-4bit) still not
covered — would need a fourth patch mirroring our GPTQ MoE logic against
AwqMoEMethod.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First-ever coherent output for this model on Lunar Lake. Prior v0.14 stack produced silent zero-output (garbled tokens) because the IPEX int4 kernel was FP4-aliased. v0.19 + four-patch stack uses the real CUTLASS int4 path (verified rel=0.003, corr=1.000 against CPU fp32 ref) and gets both correctness AND a 10x speedup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EX ops Prior MoE patch called torch.ops.torch_ipex.moe_rows_counts / moe_scatter / moe_gather / silu_and_mul — those live in intel_extension_for_pytorch, which is archived and not installed in v0.19. The earlier 11.5 tok/s Qwen3-30B GPTQ claim was an artifact of v0.14 IPEX residual in the test environment. In v0.19, vllm_xpu_kernels ships a complete xpu_fused_moe() entry point (fused_moe_interface.py) that handles routing + grouped gemm + activation + gather, using _moe_C / _xpu_C / _C namespaces. It also applies the zp=8 transform internally via implement_zp(), so weights are packed as raw u4 nibbles (no -8 subtract at pack time). Also add compressed_tensors (AWQ/pack-quantized) MoE patch that mirrors the GPTQ path. Verified on Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit on Lunar Lake: coherent output at ~22 tok/s decode.
Bench recipe and results for 1024-in / 1024-out via `vllm bench serve`. Also documents the IPEX-ops bug in the earlier MoE patch (archived IPEX ops not installed in v0.19), the fix to use vllm_xpu_kernels xpu_fused_moe() directly, and flags prior 11.5/9.3 tok/s claims as pending re-verification against commit f47f4a2.
GPT-OSS-20B launcher moved to 8081 so port 8080 can host Qwen3-Coder-30B AWQ as the default local model. openclaw config routes `local/qwen3-coder` (primary) to 8080 and `local-gptoss/gpt-oss-20b` (fallback) to 8081.
Same bench recipe (1024/1024/3) across all three Qwen3-30B-A3B INT4 variants using the fixed xpu_fused_moe() path: - qwen3-coder-awq (group_size=32): 18.22 tok/s - qwen3-30b-gptq (group_size=128): 9.97 tok/s - qwen3-vl-autoround (group_size=128): 9.68 tok/s Same architecture, same kernel, same patches. group_size=32 path is ~1.83x faster than group_size=128 in vllm_xpu_kernels v0.1.5. GPTQ and AutoRound land within 3% of each other, confirming AutoRound->GPTQ XPU routing exposes the same code path.
LNL validation data for Intel's compressed-tensors W4A16 MoE draft PR: Qwen3-Coder-30B AWQ at 18.22 tok/s, plus group_size=32 vs 128 observation and LNL-specific install gotchas. Draft only — not posted yet.
Opt-in VLLM_XPU_UNIFIED_MEMORY env var for Lunar Lake / Arrow Lake / Panther Lake iGPUs. Draft only — post from laptop when ready.
…B ceiling New findings: - mem_get_info.free underreports on unified memory; patched mem_utils.py to use psutil.virtual_memory().available on XPU (mirrors existing CUDA UMA branch). Upstream PR branch at MegaStood/vllm:xpu-integrated-gpu-mem uses opt-in VLLM_XPU_UNIFIED_MEMORY env var (safer than memory-ratio heuristic for PVC-class false positives). - v0.19.0 regression: unify_kv_cache_spec_page_size() hard-fails on hybrid attention+Mamba models (issue #38979, not in v0.19.1). Cherry- picked PR #40128's LCM-based padding fix onto kv_cache_utils.py. - Hard ceiling on Lunar Lake: ~21 GiB model weights on 30.9 GiB system. AutoRound fits (Qwen3.5-35B at 21 GiB = 8.94 tok/s); GPTQ at 23 GiB does not fit by physics (util must be 0.91, free check fails). - Added GLM-4.7-Flash (AWQ 6.58 / AR 4.71 tok/s) and Qwen3.5-35B AR (8.94 tok/s) to the benchmark matrix. Confirms gs=32 ~1.8x speedup and GQA dominates KV bandwidth. Outstanding: capture in-venv patches as .patch files, open upstream PR.
…rry-pick Two new patch files for v0.19.0: 1. xpu_unified_memory_free_check_v19.patch (13 lines) Adds `elif current_platform.is_xpu()` branch to MemorySnapshot.measure() that uses psutil.virtual_memory().available instead of mem_get_info.free. Mirrors the existing Orin/Thor/Spark CUDA UMA branch. Fixes the pre-init free-memory check on Lunar Lake where page cache from reading safetensors makes mem_get_info.free underreport by ~5-10 GiB. 2. xpu_hybrid_moe_page_size_pr40128.patch (89 lines) Cherry-picks PR #40128's LCM-based padding fix onto v0.19.0's unify_kv_cache_spec_page_size(). Required to run Qwen3.5-35B-A3B AutoRound (hybrid attention+Mamba architecture) which trips issue #38979 otherwise. Both verified with Qwen3.5-35B AutoRound loading successfully and benching at 8.94 tok/s on Arc 140V / Lunar Lake.
Validates Sandermage's hybrid KV cache page-size fix on a different hybrid architecture (Qwen3.5-35B-A3B-AutoRound on Lunar Lake iGPU). Draft only — post from laptop when ready.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Files
issues/glm4-mla-xpu-bugs.md— 3-bug MLA issue writeupissues/glm4_moe_lite_int4_xpu_marlin_shuffle.md— MoE OOM investigationissues/vllm-30359-comment.md— upstream vLLM issue comment draftscripts/fix_glm4_mla.sh— auto-fix scriptvllm/patches/glm4_moe_lite_mla_xpu.patch— unified patch