docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU by MegaStood · Pull Request #334 · intel/llm-scaler

MegaStood · 2026-03-25T13:31:34Z

Summary

3-fix patch to enable MLA (Multi-head Latent Attention) for GLM-4.7-Flash on XPU: whitelist fix, TRITON_MLA routing, XPU flash_attn import
MLA reduces KV cache 17.5x (3.67 GiB → 0.21 GiB for 4096 tokens)
MoE marlin_shuffle_weight OOM investigation: 5 approaches tested, all blocked by 32GB shared memory limitation
Auto-fix script and unified patch file included

Files

issues/glm4-mla-xpu-bugs.md — 3-bug MLA issue writeup
issues/glm4_moe_lite_int4_xpu_marlin_shuffle.md — MoE OOM investigation
issues/vllm-30359-comment.md — upstream vLLM issue comment draft
scripts/fix_glm4_mla.sh — auto-fix script
vllm/patches/glm4_moe_lite_mla_xpu.patch — unified patch

Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs or BMG-specific constants). This adds the infrastructure adaptations: - Dockerfile.lunar-lake: lightweight single-GPU image for iGPU - lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config - lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL) - platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60 - LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0, gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

One-command installer that handles: - oneAPI Base Toolkit via DNF - Level-Zero for Xe2 iGPU - PyTorch XPU in a venv - Patched vLLM build from source - vllm-xpu-kernels + triton-xpu - Bash aliases for quick launch (vllm-serve) No Docker required. Targets Nobara 43 / Fedora 42+. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Level-Zero packages aren't in Nobara/Fedora default repos. Now: - Adds Intel oneAPI + compute-runtime repos first - Tries multiple Level-Zero package names (varies across distros) - Uses --skip-unavailable to handle missing packages gracefully - Falls back to checking if libze_loader exists from xe driver https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Two issues on Nobara 43: 1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH. Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing. 2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12. Fix: detect version, auto-install python3.12 via dnf if needed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

setvars.sh has unbound variables and non-zero exits internally that trigger our set -euo pipefail, silently killing the script at Phase 2. Fix: temporarily set +euo pipefail around the source call, pipe output through grep to show only component init lines, then restore strict mode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

The pipe operator causes `source` to run in a subshell, so all oneAPI environment variables were lost in the parent shell. Redirect to a temp file instead and grep it separately. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to talk to the xe kernel driver via Level-Zero. This is provided by the intel-compute-runtime package, which was missing from the install list. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Running from ~/llm-scaler which has a vllm/ subdirectory causes 'import vllm' to succeed even when vLLM is not installed, skipping the entire build phase. Use 'pip show vllm' instead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills on 32GB shared-memory systems. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB) - Make MAX_JOBS overridable via environment variable - Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake - Add comment reminding users to plug in before building https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…unar Lake Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed: - oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface") Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround - --device xpu is not a valid vLLM CLI flag; device must be set via VLLM_TARGET_DEVICE=xpu environment variable - xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1) - Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB shared memory (peak memory doubles due to INT4→FP16 unpacking) - vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files) - GPU crash (DEVICE_LOST) after OOM requires full system reboot Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake, LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…g results Add detailed model compatibility matrix based on real testing on MSI Claw 8 (Core Ultra 7 258V, Arc 140V). Key findings: - Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention) - Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors) - Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB) - Only FP16 base + online quantization works reliably https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x): - Single-request: 17.6 tok/s generation - Batched peak: 90 tok/s (5 concurrent short requests) - Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total - Model loads in 7s using only 5.69 GiB https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results

…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V: - Venv creation with shared XPU PyTorch from vLLM install - transformers==4.57.3 pinning (newer versions break qwen-tts) - Voice cloning example with generate_voice_clone() API - Tested and confirmed working on MSI Claw 8 AI+ https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table - Add Qwen3-ASR-1.7B setup, serve, and test documentation - Use 127.0.0.1 (localhost) since OpenClaw accesses services locally - ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- New install script install_meteor_arrow_lake.sh supporting: - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45 - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67 - Auto-detects platform and GPU via PCI device IDs - Warns Meteor Lake users to switch from i915 to xe driver - Adjusts memory recommendations based on system RAM (16-96GB) - Updated evaluation script detect_gpu() for all three platforms - Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

… transformers 5.x catch-22 - Add server-side engine log observations: generation throughput decay with context length, prefix cache hit rates, KV cache usage per concurrent request - Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x Qwen2VLImageProcessor.max_pixels API break - Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported - Document transformers version catch-22: 4.x can't recognize new architectures, 5.x breaks vLLM multimodal code - Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…rc.patch Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support (min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by transformers 5.x renaming image_processor.max_pixels to size["longest_edge"]. Added one-line getattr() fix for native installs without the full patch. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB — well within Lunar Lake's 28.6 GB budget) after applying the max_pixels getattr() fix, but crashes during warmup on the Triton kernel in fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for Qwen3.5 routes through fla/ops layernorm which requires @triton.jit. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

CUTLASS INT4 packs nibbles from different output columns in each byte, not sequential nibbles along the input dimension like GPTQ. Conversion needs full layout transform, not just int32→uint8 repack. Inference pipeline works (no crash) but output is garbled due to this. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Full patch collection for running INT4 GPTQ/AutoRound MoE models on Intel Lunar Lake Xe2 iGPU with vLLM 0.14 + vllm-xpu-kernels. New patches: - xpu_gptq_moe_cutlass_sequential.patch: CUTLASS sequential expert loop bypassing broken IPEX GatedMLPMOE (BREAKTHROUGH fix) - eagle_ops_dtype_check.patch: skip ESIMD decode for non-FP16 query - ipex_attention_dtype_cast.patch: cast k/v to match q dtype - xpu_attention_direct_call.patch: bypass torch.ops.vllm attention - xpu_moe_direct_call.patch: bypass torch.ops.vllm MoE dispatch - xpu_worker_skip_profile_and_warmup.patch: skip profile + warmup Updated patches: - ipex_marlin_shuffle_skip_preshuffled.patch: full IPEX __init__ bypass - mla_xpu_return_attn_probs.patch: regenerated from current source Removed (already in source tree): - autoround_fusedmoe_ipex_routing.patch (Bug A) - gemma4_moe_top_k_experts.patch (Bug B) - gptq_math_ceil_alignment.patch (Bug C) Status: Inference pipeline completes without GPU crash. Output is garbled due to GPTQ→CUTLASS INT4 weight format mismatch (column- interleaved vs row-packed nibbles). Weight conversion fix is next. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ASS INT4 kernel The _gptq_int32_to_uint8 conversion was missing a K<->N transpose step. GPTQ stores qweight as [E, K/8, N] int32 (K-major), but CUTLASS grouped GEMM expects [E, N, K/2] uint8 (N-major). The old implementation only repacked nibbles along the existing K-major axis, producing [E, K/2, N] which caused DEVICE_LOST or garbage output with real model weights while appearing to work with zeros/random data. New implementation: unpack int32 -> [E, K, N] int8, transpose to [E, N, K], then repack nibble pairs (low = even-K, high = odd-K) -> [E, N, K/2] uint8. Layout now matches vllm-xpu-kernels docstring in fused_moe_interface.py. Documented symmetric-ZP constraint (assumes sym=True — AutoRound default). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Adds a second backend for INT4 MoE selectable via VLLM_XPU_MOE_BACKEND: - "cutlass" (default): existing chunked grouped GEMM path - "onednn": per-expert int4_gemm_w4a16 via torch.ops._xpu_C The oneDNN path uses GPTQ's native int32 [K/8, N] layout — no uint8 conversion, no shuffling, no implement_zp. Weights just need K-contiguous strides (via transpose trick) and a scalar zero-point of 8 for symmetric quantization. Slower than chunked CUTLASS (one kernel call per expert per gemm) but useful as a fallback when CUTLASS has issues on specific model shapes. Both paths still bypass the broken IPEX GatedMLPMOE (Bug H / IPEX #838). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

… backend - Explains the missing K<->N transpose that caused DEVICE_LOST with real model weights (worked with zeros/random due to symmetric data). - Documents the new VLLM_XPU_MOE_BACKEND=onednn alternative path using int4_gemm_w4a16 with native GPTQ int32 layout as a fallback. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Correct weight conversion using int4_gemm_w4a16 (oneDNN format): - No marlin_shuffle, no implement_zp, no uint8 conversion - GPTQ int32 weights used directly with K-contiguous strides - Scalar zero_point=8 for symmetric GPTQ - 0.999997 correlation with CPU reference dequant Inference produces real Chinese text (not garbled), 50 tokens generated, HTTP 200 OK. Performance is 0.16 tok/s (sequential expert loop — optimization needed). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Optimized apply(): pre-compute active experts via .unique(), use index_add_() for scatter-accumulate, fix dtype cast. 315s → 41s for 30 tokens. Generation: 0.16 → 0.9 tok/s. Kernel benchmarks show 30μs per int4_gemm_w4a16 call — the bottleneck is now Python loop overhead (88% of total time). Updated xpu_gptq_moe_int4_w4a16.patch with optimized apply(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Confirmed: GPT-OSS-20B runs via GatedMLPMOE(is_mxfp4=True) on IPEX marlin backend. Same GatedMLPMOE crashes with is_int4=True but works with is_mxfp4=True — the bug is in IPEX's INT4 kernel path. Benchmark: MXFP4 gets 3.0 tok/s (batched), INT4 gets 0.9 tok/s (sequential). ~6.5x faster per layer due to grouped GEMM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Both INT4 and MXFP4 CUTLASS grouped GEMM crash with DEVICE_LOST at 16+ active expert groups. IPEX moe_gemm handles 32 experts fine. The bug is in CUTLASS's Level Zero dispatch, not data format. IPEX INT4 crashes for a separate reason (kernel bug in INT4 dequant). GPT-OSS-20B benchmark: 5.4 tok/s gen, 14.2 tok/s prompt via IPEX. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…1 not real INT4 cutlass_grouped_gemm_interface(is_B_int4=True) treats nibbles as FP4 E2M1 microscaling format (2 exponent + 1 mantissa bit), NOT integer 4-bit. Value 15 maps to NaN, other values are exponentially wrong. The entire CUTLASS INT4 path (implement_zp, uint8 repack, chunked dispatch) produces garbage output with real GPTQ model weights. int4_gemm_w4a16 (oneDNN) is the correct kernel — reads native GPTQ int32 weights and does proper (nibble - zp) * scale dequantization. 30μs/call, 2x faster than CUTLASS, 40x faster than BF16 at M=1. Changes: - Default VLLM_XPU_MOE_BACKEND switched from "cutlass" to "onednn" - CUTLASS INT4 path marked as broken in code comments - Bug H docs rewritten with FP4 vs INT4 comparison table, correct kernel identification, and throughput analysis https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ke-compatibility-CB5w6 # Conflicts: # vllm/patches/vllm_for_multi_arc.patch

Bug H has two independent failures: 1. GatedMLPMOE crash (SOLVED) — topk_softmax + wrapper INT4 path, bypassed with torch.topk + direct IPEX ops 2. group_mm_int4_out_marlin returns zeros/NaN on Xe2 (UNSOLVED) — kernel broken, blocks batched path. 4-layout test proves no weight layout fix possible. Correct old "Level Zero context pollution" theory — crash was in GatedMLPMOE/topk_softmax, not attention corrupting driver state. Patch cleanup: - Delete xpu_gptq_moe_cpu_shuffle.patch (superseded by int4_w4a16) - Delete xpu_gptq_moe_cutlass_sequential.patch (CUTLASS is FP4 E2M1) - Delete xpu_skip_warmup_dummy_run.patch (subset of superset patch) - Delete vllm_xpu_worker_skip_profile.patch (subset of superset patch) - Add Bug J (eagle_ops dtype) and Bug K (direct-call workarounds) - Fix Bugs A/B/C patch refs (removed — fixed upstream) - Update status table with all 11 bugs (A-K) https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Critical fixes: - transformers>=5.5.0: graceful fallback to 4.x if 5.x not on PyPI - triton: use --force-reinstall instead of uninstall+install to avoid leaving venv without any triton on network failure - MAX_JOBS: default 4 (was 6) to avoid OOM on 32GB Lunar Lake - triton-xpu: unpin version, let pip resolve torch 2.10 compatibility Significant fixes: - Add root guard (EUID check) — sudo ./install would put venv in /root/ - Add GPU access preflight (/dev/dri/renderD128 permission check) - Python version: check >=3.10 AND <=3.12 (was only <=3.12) - sed xpu_worker patch: replace fragile range pattern with Python regex - MKL LD_PRELOAD: add to vllm-v19-activate alias so it persists - torch XPU check: use stable torch.xpu.is_available() API - Document RHEL repo usage on Fedora with verification note https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

32GB swap is configured, so peak 24GB RAM from 6 parallel SYCL compile units won't OOM, just spill to swap. Build stays fast. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Three alignments with vLLM 0.19.0's own requirements/xpu.txt and docs: 1. torch: install torch==2.10.0+xpu via --extra-index-url (matches the exact pin in requirements/xpu.txt). The previous torch==2.10.0 + --index-url worked but is not the official pinning. 2. vllm-xpu-kernels: Phase 4's `pip install -r requirements/xpu.txt` already installs the pre-built wheel vllm_xpu_kernels==0.1.4 (pinned in upstream). The 1.5-2 hour source build of v0.1.5 in Phase 5 is now OPT-IN via VLLM_BUILD_XPU_KERNELS=1. Default install is ~30 min instead of ~2 hours. 3. triton-xpu: pin to 3.6.0 (required for torch 2.10 per v0.19 docs) and switch from --extra-index-url=.../whl/test/xpu (rolling test index) to --extra-index-url=.../whl/xpu (stable). test/ may ship 3.7.x which doesn't pair correctly with torch 2.10. Refs: - https://github.com/vllm-project/vllm/blob/v0.19.0/requirements/xpu.txt - https://github.com/vllm-project/vllm/blob/v0.19.0/docs/getting_started/installation/gpu.xpu.inc.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…unar Lake findings Installer fix: - vLLM 0.19.0's requirements/xpu.txt pins vllm_xpu_kernels==0.1.4 which does NOT recognise Lunar Lake as XE2 (is_xe2_arch() only matches bmg_g21, bmg_g31, and pvc; intel_gpu_lnl_m is missing). First inference request fails with "Only XE2 cutlass kernel is supported currently." and kills the engine. - v0.1.5 adds intel_gpu_lnl_m to the allowlist. Pre-built wheel available on GitHub releases — no source build needed. - Phase 4 now force-reinstalls v0.1.5 right after requirements/xpu.txt. Quick-start example updated: - Shows --kv-cache-memory-bytes workaround (v0.19 replacement for our v0.14 VLLM_SKIP_PROFILE_RUN=1 patch — v0.19's MXFP4 profile-run peak busts the 28 GiB Lunar Lake budget even at max_num_seqs=4). - Adds drop_caches / reboot hints for the XPU shared-memory leak that accumulates across crashed launches. Findings doc (issues/vllm-v19-lunar-lake-findings.md) captures: - Exact v0.1.4 vs v0.1.5 diff in csrc/utils.h::is_xe2_arch - Why --kv-cache-memory-bytes is needed on Lunar Lake - XPU shared-memory leak symptom and workaround - What's carried forward from v0.14 vs what's new in v0.19 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end verification after v0.1.5 wheel fix (post-reboot): - GPT-OSS-20B MXFP4 serving on v0.19 works. - vllm bench serve (single-user, --ignore-eos): 128 in/out: 19.1 tok/s avg, 43 ms TPOT 1024 in/out: 15.2 tok/s avg, 67 ms TPOT - Steady-state decode ≈ 23 tok/s — on par with v0.14 IPEX GatedMLPMOE(is_mxfp4=True) baseline of ~22 tok/s. MXFP4 gets no meaningful speedup on v0.19; real payoff is expected for INT4 AutoRound MoE (not tested yet). Added openclaw integration config (matches local/gpt-oss-20b provider on 127.0.0.1:8080, uses --served-model-name so short ids resolve). Expanded Finding 3 (XPU shared-memory leak): - Not confined to SIGKILL — Ctrl+C on graceful vllm serve also leaks. - Observed pattern of degrading XPU free across the session. - `sync && echo 3 > drop_caches` usually recovers enough to relaunch; below ~15 GiB free after cache drop indicates a real L0 mapping leak and only reboot resets it. Memory budget math at 64k context documented (48 KB/token/layer). KV-cache-memory-bytes config for single-seq vs two-seq worked out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…kernel Tonight's finding: vllm_xpu_kernels v0.1.5 has a real, correct INT4 MoE kernel (cutlass_grouped_gemm_interface with is_B_int4=True). Verified numerically against a CPU fp32 dequant+matmul reference using Qwen3-VL- 30B-A3B's layer 0 expert 0 gate_proj weights: rel error: 0.003 (bf16 noise floor) correlation: 1.000 range: [-3.27, 3.09] vs ref [-3.26, 3.10] v0.1.4 had `is_B_int4` actually routing to FP4 E2M1 (documented bug in earlier findings). v0.1.5 separates the two via scale dtype: uint8 B with bf16/fp16 scales = real INT4; uint8 B with uint8 E8M0 scales = MXFP4. vLLM 0.19 itself doesn't wire this up — the default GPTQ MoE path calls CUDA-only torch.ops._C.gptq_shuffle, the AutoRound path raises explicit NotImplementedError during 'xpu kernel migration', and AWQ also hits the missing gptq_marlin_repack kernel. RFC #33214 confirms int4_moe_support has no PR yet. Patch adds an XPU branch to GPTQMarlinMoEMethod that: - process_weights_after_loading: GPTQ int32-packed [E, K/8, 2N] -> uint8 2's-complement [E, 2N, K/2]; scales transposed to [E, 2N, K/GS]. All done per-expert on CPU to avoid XPU allocator pressure, original weights freed before new ones materialize. - apply: skips upstream CUDA kernel, uses topk -> moe_rows_counts -> moe_scatter -> cutlass_grouped_gemm(is_B_int4=True) x2 with silu_and_mul -> moe_gather pipeline. Still to do before this can ship: - Apply patch to a live v0.19 venv and test Qwen3-30B-a3b-gptq-int4 end-to-end - Verify the xpu_int4_ready flag is reached before the modular XPUExperts path takes over - Handle GPTQ asym mode (desc_act, non-default qzeros) The isolated-kernel test is the crucial bit. Wiring is the cheap part — our v0.14 patch a7ee41d followed the same shape against a broken kernel (v0.1.4 FP4) and got garbled output; against the now-verified v0.1.5 kernel it should finally produce coherent output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First end-to-end success for a non-MXFP4 MoE model on vLLM 0.19 + Intel Arc 140V. Verified with Qwen3-30B-a3b-gptq-int4: weights load: 15.56 GiB in 40.1s decode: 100 tokens in 8.67s = 11.5 tok/s output: coherent English, no garbling Prior state on v0.14 + our patches (Python sequential expert loop via int4_gemm_w4a16): 0.9 tok/s. This is a ~12.8x speedup — the batched CUTLASS int4 kernel we verified earlier (v0.1.5, rel=0.003, corr=1.000) now delivers in practice. Two patches combined: 1. xpu_gptq_awq_linear_int4_pr33662.patch Upstream PR #33662 ("[XPU][3/N] add int4 gemm support for xpu"). Covers LINEAR gptq/awq layers (q/k/v/o_proj, lm_head, etc.) which previously tripped on CUDA-only torch.ops._C.gptq_shuffle. Uses vllm_xpu_kernels v0.1.5's GPTQUtils / transpose_onednn_woq_format helpers + int4_gemm_w4a16 dispatch. 2. xpu_gptq_moe_int4_cutlass_v19.patch (updated from draft, now a clean diff against v0.19.0 upstream) Our MoE half — adds XPU branch to GPTQMarlinMoEMethod: process_weights_after_loading: int32-packed GPTQ -> uint8 2's-comp [E, 2N, K/2], scales [E, 2N, K/GS] apply: topk -> moe_rows_counts -> moe_scatter -> cutlass_grouped_gemm(is_B_int4=True) x2 + silu_and_mul -> moe_gather PR #33662 alone cannot load a GPTQ MoE model (linear layers prep works but MoE hits gptq_shuffle). Our MoE patch alone can't boot a model whose linear layers also use GPTQ (same trip point at load). Both required. Remaining work: - AutoRound routing (Bug-A-style) to send XPU FusedMoE to GPTQMarlinMoEMethod, bypassing 'INC not supported during xpu kernel migration'. Unlocks Qwen3-VL-30B, GLM-4.7-Flash, Qwen3.5. - AWQ MoE (analog of our GPTQ MoE patch). - GPTQ asym (desc_act) qzeros handling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…of INC stub AutoRound's auto_round:auto_gptq and auto_round:auto_awq packing formats are bit-compatible with plain GPTQ and AWQ. INCConfig.apply_gptq_quant_layer already constructs GPTQMarlinMoEMethod / GPTQMarlinLinearMethod for these models — the only bug is get_quant_method short-circuiting XPU to apply_ipex_quant_layer which raises NotImplementedError during the XPU kernel migration. Patch adds a 10-line XPU branch in get_quant_method: route AutoRound to apply_gptq_quant_layer or apply_awq_quant_layer directly, before the legacy IPEX dispatch. Both destinations have XPU-aware apply() now: - Linear layers: via PR #33662 (int4_gemm_w4a16) - GPTQ MoE: via our xpu_gptq_moe_int4_cutlass_v19.patch (cutlass_grouped_gemm_interface with is_B_int4=True) End-to-end test with Qwen3-VL-30B-A3B blocked by XPU shared-memory leak from the successful Qwen3-30B GPTQ run (preflight reported 0.75 GiB free — needs reboot to clear). Syntax verified; routing logic is a direct mirror of v0.14's autoround_fusedmoe_ipex_routing.patch (Bug A) which went through the same path. Three patches now compose the XPU INT4 MoE stack on v0.19: 1. xpu_gptq_awq_linear_int4_pr33662.patch (upstream PR #33662) 2. xpu_gptq_moe_int4_cutlass_v19.patch (our MoE add) 3. xpu_autoround_route_to_gptq_awq_v19.patch (this — routing shim) AWQ MoE (qwen3-coder-30b-a3b-awq, glm-4.7-flash-awq-4bit) still not covered — would need a fourth patch mirroring our GPTQ MoE logic against AwqMoEMethod. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

First-ever coherent output for this model on Lunar Lake. Prior v0.14 stack produced silent zero-output (garbled tokens) because the IPEX int4 kernel was FP4-aliased. v0.19 + four-patch stack uses the real CUTLASS int4 path (verified rel=0.003, corr=1.000 against CPU fp32 ref) and gets both correctness AND a 10x speedup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…EX ops Prior MoE patch called torch.ops.torch_ipex.moe_rows_counts / moe_scatter / moe_gather / silu_and_mul — those live in intel_extension_for_pytorch, which is archived and not installed in v0.19. The earlier 11.5 tok/s Qwen3-30B GPTQ claim was an artifact of v0.14 IPEX residual in the test environment. In v0.19, vllm_xpu_kernels ships a complete xpu_fused_moe() entry point (fused_moe_interface.py) that handles routing + grouped gemm + activation + gather, using _moe_C / _xpu_C / _C namespaces. It also applies the zp=8 transform internally via implement_zp(), so weights are packed as raw u4 nibbles (no -8 subtract at pack time). Also add compressed_tensors (AWQ/pack-quantized) MoE patch that mirrors the GPTQ path. Verified on Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit on Lunar Lake: coherent output at ~22 tok/s decode.

Bench recipe and results for 1024-in / 1024-out via `vllm bench serve`. Also documents the IPEX-ops bug in the earlier MoE patch (archived IPEX ops not installed in v0.19), the fix to use vllm_xpu_kernels xpu_fused_moe() directly, and flags prior 11.5/9.3 tok/s claims as pending re-verification against commit f47f4a2.

GPT-OSS-20B launcher moved to 8081 so port 8080 can host Qwen3-Coder-30B AWQ as the default local model. openclaw config routes `local/qwen3-coder` (primary) to 8080 and `local-gptoss/gpt-oss-20b` (fallback) to 8081.

Same bench recipe (1024/1024/3) across all three Qwen3-30B-A3B INT4 variants using the fixed xpu_fused_moe() path: - qwen3-coder-awq (group_size=32): 18.22 tok/s - qwen3-30b-gptq (group_size=128): 9.97 tok/s - qwen3-vl-autoround (group_size=128): 9.68 tok/s Same architecture, same kernel, same patches. group_size=32 path is ~1.83x faster than group_size=128 in vllm_xpu_kernels v0.1.5. GPTQ and AutoRound land within 3% of each other, confirming AutoRound->GPTQ XPU routing exposes the same code path.

LNL validation data for Intel's compressed-tensors W4A16 MoE draft PR: Qwen3-Coder-30B AWQ at 18.22 tok/s, plus group_size=32 vs 128 observation and LNL-specific install gotchas. Draft only — not posted yet.

Opt-in VLLM_XPU_UNIFIED_MEMORY env var for Lunar Lake / Arrow Lake / Panther Lake iGPUs. Draft only — post from laptop when ready.

…B ceiling New findings: - mem_get_info.free underreports on unified memory; patched mem_utils.py to use psutil.virtual_memory().available on XPU (mirrors existing CUDA UMA branch). Upstream PR branch at MegaStood/vllm:xpu-integrated-gpu-mem uses opt-in VLLM_XPU_UNIFIED_MEMORY env var (safer than memory-ratio heuristic for PVC-class false positives). - v0.19.0 regression: unify_kv_cache_spec_page_size() hard-fails on hybrid attention+Mamba models (issue #38979, not in v0.19.1). Cherry- picked PR #40128's LCM-based padding fix onto kv_cache_utils.py. - Hard ceiling on Lunar Lake: ~21 GiB model weights on 30.9 GiB system. AutoRound fits (Qwen3.5-35B at 21 GiB = 8.94 tok/s); GPTQ at 23 GiB does not fit by physics (util must be 0.91, free check fails). - Added GLM-4.7-Flash (AWQ 6.58 / AR 4.71 tok/s) and Qwen3.5-35B AR (8.94 tok/s) to the benchmark matrix. Confirms gs=32 ~1.8x speedup and GQA dominates KV bandwidth. Outstanding: capture in-venv patches as .patch files, open upstream PR.

…rry-pick Two new patch files for v0.19.0: 1. xpu_unified_memory_free_check_v19.patch (13 lines) Adds `elif current_platform.is_xpu()` branch to MemorySnapshot.measure() that uses psutil.virtual_memory().available instead of mem_get_info.free. Mirrors the existing Orin/Thor/Spark CUDA UMA branch. Fixes the pre-init free-memory check on Lunar Lake where page cache from reading safetensors makes mem_get_info.free underreport by ~5-10 GiB. 2. xpu_hybrid_moe_page_size_pr40128.patch (89 lines) Cherry-picks PR #40128's LCM-based padding fix onto v0.19.0's unify_kv_cache_spec_page_size(). Required to run Qwen3.5-35B-A3B AutoRound (hybrid attention+Mamba architecture) which trips issue #38979 otherwise. Both verified with Qwen3.5-35B AutoRound loading successfully and benching at 8.94 tok/s on Arc 140V / Lunar Lake.

Validates Sandermage's hybrid KV cache page-size fix on a different hybrid architecture (Qwen3.5-35B-A3B-AutoRound on Lunar Lake iGPU). Draft only — post from laptop when ready.

claude and others added 30 commits March 20, 2026 09:57

Add Lunar Lake Xe2 140V compatibility report

1ef4516

Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Fix PyTorch XPU install: use correct index URL without +xpu suffix

9aedeff

The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Pin PyTorch XPU to 2.10.0 with matching torchvision/torchaudio versions

efbf626

Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Remove tail pipes so build progress is visible

a235798

All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Remove remaining tail pipes and retry stale xpu-kernels build

adf3638

Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

bench: add vllm benchmark results 2026-03-27 (13.6 tok/s, qwen3-8b-int4)

8840b82

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bench: add concurrency 2 and 5 benchmark results (25.5 / 19.3 tok/s)

fcfb119

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bench: add concurrency 5 capped test + final summary (36.4 tok/s, no …

e6262f3

…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #1 from MegaStood/claude/check-lunar-lake-compatib…

56ae5a5

…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results

bench: add Qwen3-ASR-1.7B benchmark results (81.9 tok/s at concurrenc…

505b3cf

…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nobara-user and others added 30 commits April 16, 2026 23:42

Merge remote-tracking branch 'origin/main' into claude/check-lunar-la…

6ecf709

…ke-compatibility-CB5w6 # Conflicts: # vllm/patches/vllm_for_multi_arc.patch

revert: MAX_JOBS back to 6 — 32GB swap absorbs overflow

e254ae4

32GB swap is configured, so peak 24GB RAM from 6 parallel SYCL compile units won't OOM, just spill to swap. Build stays fast. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

install: quick-start example port 8080 → 8081 for gpt-oss

552fba8

GPT-OSS-20B launcher moved to 8081 so port 8080 can host Qwen3-Coder-30B AWQ as the default local model. openclaw config routes `local/qwen3-coder` (primary) to 8080 and `local-gptoss/gpt-oss-20b` (fallback) to 8081.

docs: draft comment for vllm-project/vllm PR #32899

c6b6447

LNL validation data for Intel's compressed-tensors W4A16 MoE draft PR: Qwen3-Coder-30B AWQ at 18.22 tok/s, plus group_size=32 vs 128 observation and LNL-specific install gotchas. Draft only — not posted yet.

docs: draft PR body for vllm-project/vllm xpu-integrated-gpu-mem branch

ea10f15

Opt-in VLLM_XPU_UNIFIED_MEMORY env var for Lunar Lake / Arrow Lake / Panther Lake iGPUs. Draft only — post from laptop when ready.

docs: draft comment for vllm-project/vllm PR #40128

ee69270

Validates Sandermage's hybrid KV cache page-size fix on a different hybrid architecture (Qwen3.5-35B-A3B-AutoRound on Lunar Lake iGPU). Draft only — post from laptop when ready.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334
MegaStood wants to merge 157 commits into
intel:mainfrom
MegaStood:claude/check-lunar-lake-compatibility-CB5w6

MegaStood commented Mar 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MegaStood commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MegaStood commented Mar 25, 2026 •

edited

Loading