Skip to content

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334

Open
MegaStood wants to merge 157 commits into
intel:mainfrom
MegaStood:claude/check-lunar-lake-compatibility-CB5w6
Open

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334
MegaStood wants to merge 157 commits into
intel:mainfrom
MegaStood:claude/check-lunar-lake-compatibility-CB5w6

Conversation

@MegaStood
Copy link
Copy Markdown

@MegaStood MegaStood commented Mar 25, 2026

Summary

  • 3-fix patch to enable MLA (Multi-head Latent Attention) for GLM-4.7-Flash on XPU: whitelist fix, TRITON_MLA routing, XPU flash_attn import
  • MLA reduces KV cache 17.5x (3.67 GiB → 0.21 GiB for 4096 tokens)
  • MoE marlin_shuffle_weight OOM investigation: 5 approaches tested, all blocked by 32GB shared memory limitation
  • Auto-fix script and unified patch file included

Files

  • issues/glm4-mla-xpu-bugs.md — 3-bug MLA issue writeup
  • issues/glm4_moe_lite_int4_xpu_marlin_shuffle.md — MoE OOM investigation
  • issues/vllm-30359-comment.md — upstream vLLM issue comment draft
  • scripts/fix_glm4_mla.sh — auto-fix script
  • vllm/patches/glm4_moe_lite_mla_xpu.patch — unified patch

claude and others added 30 commits March 20, 2026 09:57
Documents incompatibility — this project targets discrete Arc Pro B60
GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs
or BMG-specific constants). This adds the infrastructure adaptations:

- Dockerfile.lunar-lake: lightweight single-GPU image for iGPU
- lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config
- lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL)
- platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60
- LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs

Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0,
gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
One-command installer that handles:
- oneAPI Base Toolkit via DNF
- Level-Zero for Xe2 iGPU
- PyTorch XPU in a venv
- Patched vLLM build from source
- vllm-xpu-kernels + triton-xpu
- Bash aliases for quick launch (vllm-serve)

No Docker required. Targets Nobara 43 / Fedora 42+.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Level-Zero packages aren't in Nobara/Fedora default repos. Now:
- Adds Intel oneAPI + compute-runtime repos first
- Tries multiple Level-Zero package names (varies across distros)
- Uses --skip-unavailable to handle missing packages gracefully
- Falls back to checking if libze_loader exists from xe driver

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Two issues on Nobara 43:
1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH.
   Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing.
2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12.
   Fix: detect version, auto-install python3.12 via dnf if needed.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
setvars.sh has unbound variables and non-zero exits internally that
trigger our set -euo pipefail, silently killing the script at Phase 2.
Fix: temporarily set +euo pipefail around the source call, pipe output
through grep to show only component init lines, then restore strict mode.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The pipe operator causes `source` to run in a subshell, so all oneAPI
environment variables were lost in the parent shell. Redirect to a temp
file instead and grep it separately.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The +xpu version suffix doesn't exist on PyPI. XPU wheels are served
from a separate index URL. Also remove tail pipe so download progress
is visible.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Use exact version pins from official Intel XPU install docs to avoid
pulling incompatible versions.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to
talk to the xe kernel driver via Level-Zero. This is provided by the
intel-compute-runtime package, which was missing from the install list.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
All the | tail -N pipes were hiding pip install and build output,
making the script appear frozen during long operations.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Running from ~/llm-scaler which has a vllm/ subdirectory causes
'import vllm' to succeed even when vLLM is not installed, skipping
the entire build phase. Use 'pip show vllm' instead.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Show full build output so errors are visible. Also clean up stale
vllm-xpu-kernels directory if previous build failed.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts
to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative
RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills
on 32GB shared-memory systems.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB)
- Make MAX_JOBS overridable via environment variable
- Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake
- Add comment reminding users to plug in before building

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…unar Lake

Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed:
- oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface")
  Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround
- --device xpu is not a valid vLLM CLI flag; device must be set via
  VLLM_TARGET_DEVICE=xpu environment variable
- xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1)
- Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB
  shared memory (peak memory doubles due to INT4→FP16 unpacking)
- vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files)
- GPU crash (DEVICE_LOST) after OOM requires full system reboot

Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake,
LUNAR_LAKE_COMPATIBILITY.md

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…g results

Add detailed model compatibility matrix based on real testing on MSI Claw 8
(Core Ultra 7 258V, Arc 140V). Key findings:
- Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention)
- Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors)
- Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB)
- Only FP16 base + online quantization works reliably

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeouts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x):
- Single-request: 17.6 tok/s generation
- Batched peak: 90 tok/s (5 concurrent short requests)
- Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total
- Model loads in 7s using only 5.69 GiB

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results
…y 5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V:
- Venv creation with shared XPU PyTorch from vLLM install
- transformers==4.57.3 pinning (newer versions break qwen-tts)
- Voice cloning example with generate_voice_clone() API
- Tested and confirmed working on MSI Claw 8 AI+

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table
- Add Qwen3-ASR-1.7B setup, serve, and test documentation
- Use 127.0.0.1 (localhost) since OpenClaw accesses services locally
- ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but
OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters
DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- New install script install_meteor_arrow_lake.sh supporting:
  - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45
  - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67
- Auto-detects platform and GPU via PCI device IDs
- Warns Meteor Lake users to switch from i915 to xe driver
- Adjusts memory recommendations based on system RAM (16-96GB)
- Updated evaluation script detect_gpu() for all three platforms
- Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… transformers 5.x catch-22

- Add server-side engine log observations: generation throughput decay with
  context length, prefix cache hit rates, KV cache usage per concurrent request
- Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x
  Qwen2VLImageProcessor.max_pixels API break
- Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported
- Document transformers version catch-22: 4.x can't recognize new architectures,
  5.x breaks vLLM multimodal code
- Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…rc.patch

Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with
vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support
(min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by
transformers 5.x renaming image_processor.max_pixels to size["longest_edge"].
Added one-line getattr() fix for native installs without the full patch.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB —
well within Lunar Lake's 28.6 GB budget) after applying the max_pixels
getattr() fix, but crashes during warmup on the Triton kernel in
fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for
Qwen3.5 routes through fla/ops layernorm which requires @triton.jit.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
nobara-user and others added 30 commits April 16, 2026 23:42
CUTLASS INT4 packs nibbles from different output columns in each byte,
not sequential nibbles along the input dimension like GPTQ.

Conversion needs full layout transform, not just int32→uint8 repack.
Inference pipeline works (no crash) but output is garbled due to this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full patch collection for running INT4 GPTQ/AutoRound MoE models
on Intel Lunar Lake Xe2 iGPU with vLLM 0.14 + vllm-xpu-kernels.

New patches:
- xpu_gptq_moe_cutlass_sequential.patch: CUTLASS sequential expert
  loop bypassing broken IPEX GatedMLPMOE (BREAKTHROUGH fix)
- eagle_ops_dtype_check.patch: skip ESIMD decode for non-FP16 query
- ipex_attention_dtype_cast.patch: cast k/v to match q dtype
- xpu_attention_direct_call.patch: bypass torch.ops.vllm attention
- xpu_moe_direct_call.patch: bypass torch.ops.vllm MoE dispatch
- xpu_worker_skip_profile_and_warmup.patch: skip profile + warmup

Updated patches:
- ipex_marlin_shuffle_skip_preshuffled.patch: full IPEX __init__ bypass
- mla_xpu_return_attn_probs.patch: regenerated from current source

Removed (already in source tree):
- autoround_fusedmoe_ipex_routing.patch (Bug A)
- gemma4_moe_top_k_experts.patch (Bug B)
- gptq_math_ceil_alignment.patch (Bug C)

Status: Inference pipeline completes without GPU crash. Output is
garbled due to GPTQ→CUTLASS INT4 weight format mismatch (column-
interleaved vs row-packed nibbles). Weight conversion fix is next.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ASS INT4 kernel

The _gptq_int32_to_uint8 conversion was missing a K<->N transpose step.
GPTQ stores qweight as [E, K/8, N] int32 (K-major), but CUTLASS grouped
GEMM expects [E, N, K/2] uint8 (N-major). The old implementation only
repacked nibbles along the existing K-major axis, producing [E, K/2, N]
which caused DEVICE_LOST or garbage output with real model weights while
appearing to work with zeros/random data.

New implementation: unpack int32 -> [E, K, N] int8, transpose to [E, N, K],
then repack nibble pairs (low = even-K, high = odd-K) -> [E, N, K/2] uint8.
Layout now matches vllm-xpu-kernels docstring in fused_moe_interface.py.

Documented symmetric-ZP constraint (assumes sym=True — AutoRound default).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Adds a second backend for INT4 MoE selectable via VLLM_XPU_MOE_BACKEND:
  - "cutlass" (default): existing chunked grouped GEMM path
  - "onednn": per-expert int4_gemm_w4a16 via torch.ops._xpu_C

The oneDNN path uses GPTQ's native int32 [K/8, N] layout — no uint8
conversion, no shuffling, no implement_zp. Weights just need K-contiguous
strides (via transpose trick) and a scalar zero-point of 8 for symmetric
quantization. Slower than chunked CUTLASS (one kernel call per expert
per gemm) but useful as a fallback when CUTLASS has issues on specific
model shapes.

Both paths still bypass the broken IPEX GatedMLPMOE (Bug H / IPEX #838).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… backend

- Explains the missing K<->N transpose that caused DEVICE_LOST with real
  model weights (worked with zeros/random due to symmetric data).
- Documents the new VLLM_XPU_MOE_BACKEND=onednn alternative path using
  int4_gemm_w4a16 with native GPTQ int32 layout as a fallback.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Correct weight conversion using int4_gemm_w4a16 (oneDNN format):
- No marlin_shuffle, no implement_zp, no uint8 conversion
- GPTQ int32 weights used directly with K-contiguous strides
- Scalar zero_point=8 for symmetric GPTQ
- 0.999997 correlation with CPU reference dequant

Inference produces real Chinese text (not garbled), 50 tokens
generated, HTTP 200 OK. Performance is 0.16 tok/s (sequential
expert loop — optimization needed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimized apply(): pre-compute active experts via .unique(),
use index_add_() for scatter-accumulate, fix dtype cast.

315s → 41s for 30 tokens. Generation: 0.16 → 0.9 tok/s.
Kernel benchmarks show 30μs per int4_gemm_w4a16 call — the
bottleneck is now Python loop overhead (88% of total time).

Updated xpu_gptq_moe_int4_w4a16.patch with optimized apply().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Confirmed: GPT-OSS-20B runs via GatedMLPMOE(is_mxfp4=True) on IPEX
marlin backend. Same GatedMLPMOE crashes with is_int4=True but
works with is_mxfp4=True — the bug is in IPEX's INT4 kernel path.

Benchmark: MXFP4 gets 3.0 tok/s (batched), INT4 gets 0.9 tok/s
(sequential). ~6.5x faster per layer due to grouped GEMM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both INT4 and MXFP4 CUTLASS grouped GEMM crash with DEVICE_LOST
at 16+ active expert groups. IPEX moe_gemm handles 32 experts fine.

The bug is in CUTLASS's Level Zero dispatch, not data format.
IPEX INT4 crashes for a separate reason (kernel bug in INT4 dequant).

GPT-OSS-20B benchmark: 5.4 tok/s gen, 14.2 tok/s prompt via IPEX.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…1 not real INT4

cutlass_grouped_gemm_interface(is_B_int4=True) treats nibbles as FP4
E2M1 microscaling format (2 exponent + 1 mantissa bit), NOT integer
4-bit. Value 15 maps to NaN, other values are exponentially wrong.
The entire CUTLASS INT4 path (implement_zp, uint8 repack, chunked
dispatch) produces garbage output with real GPTQ model weights.

int4_gemm_w4a16 (oneDNN) is the correct kernel — reads native GPTQ
int32 weights and does proper (nibble - zp) * scale dequantization.
30μs/call, 2x faster than CUTLASS, 40x faster than BF16 at M=1.

Changes:
- Default VLLM_XPU_MOE_BACKEND switched from "cutlass" to "onednn"
- CUTLASS INT4 path marked as broken in code comments
- Bug H docs rewritten with FP4 vs INT4 comparison table, correct
  kernel identification, and throughput analysis

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ke-compatibility-CB5w6

# Conflicts:
#	vllm/patches/vllm_for_multi_arc.patch
Bug H has two independent failures:
1. GatedMLPMOE crash (SOLVED) — topk_softmax + wrapper INT4 path,
   bypassed with torch.topk + direct IPEX ops
2. group_mm_int4_out_marlin returns zeros/NaN on Xe2 (UNSOLVED) —
   kernel broken, blocks batched path. 4-layout test proves no
   weight layout fix possible.

Correct old "Level Zero context pollution" theory — crash was in
GatedMLPMOE/topk_softmax, not attention corrupting driver state.

Patch cleanup:
- Delete xpu_gptq_moe_cpu_shuffle.patch (superseded by int4_w4a16)
- Delete xpu_gptq_moe_cutlass_sequential.patch (CUTLASS is FP4 E2M1)
- Delete xpu_skip_warmup_dummy_run.patch (subset of superset patch)
- Delete vllm_xpu_worker_skip_profile.patch (subset of superset patch)
- Add Bug J (eagle_ops dtype) and Bug K (direct-call workarounds)
- Fix Bugs A/B/C patch refs (removed — fixed upstream)
- Update status table with all 11 bugs (A-K)

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Critical fixes:
- transformers>=5.5.0: graceful fallback to 4.x if 5.x not on PyPI
- triton: use --force-reinstall instead of uninstall+install to avoid
  leaving venv without any triton on network failure
- MAX_JOBS: default 4 (was 6) to avoid OOM on 32GB Lunar Lake
- triton-xpu: unpin version, let pip resolve torch 2.10 compatibility

Significant fixes:
- Add root guard (EUID check) — sudo ./install would put venv in /root/
- Add GPU access preflight (/dev/dri/renderD128 permission check)
- Python version: check >=3.10 AND <=3.12 (was only <=3.12)
- sed xpu_worker patch: replace fragile range pattern with Python regex
- MKL LD_PRELOAD: add to vllm-v19-activate alias so it persists
- torch XPU check: use stable torch.xpu.is_available() API
- Document RHEL repo usage on Fedora with verification note

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
32GB swap is configured, so peak 24GB RAM from 6 parallel SYCL compile
units won't OOM, just spill to swap. Build stays fast.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Three alignments with vLLM 0.19.0's own requirements/xpu.txt and docs:

1. torch: install torch==2.10.0+xpu via --extra-index-url (matches the
   exact pin in requirements/xpu.txt). The previous torch==2.10.0 +
   --index-url worked but is not the official pinning.

2. vllm-xpu-kernels: Phase 4's `pip install -r requirements/xpu.txt`
   already installs the pre-built wheel vllm_xpu_kernels==0.1.4 (pinned
   in upstream). The 1.5-2 hour source build of v0.1.5 in Phase 5 is
   now OPT-IN via VLLM_BUILD_XPU_KERNELS=1. Default install is ~30 min
   instead of ~2 hours.

3. triton-xpu: pin to 3.6.0 (required for torch 2.10 per v0.19 docs) and
   switch from --extra-index-url=.../whl/test/xpu (rolling test index)
   to --extra-index-url=.../whl/xpu (stable). test/ may ship 3.7.x which
   doesn't pair correctly with torch 2.10.

Refs:
- https://github.com/vllm-project/vllm/blob/v0.19.0/requirements/xpu.txt
- https://github.com/vllm-project/vllm/blob/v0.19.0/docs/getting_started/installation/gpu.xpu.inc.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…unar Lake findings

Installer fix:
- vLLM 0.19.0's requirements/xpu.txt pins vllm_xpu_kernels==0.1.4 which
  does NOT recognise Lunar Lake as XE2 (is_xe2_arch() only matches bmg_g21,
  bmg_g31, and pvc; intel_gpu_lnl_m is missing). First inference request
  fails with "Only XE2 cutlass kernel is supported currently." and kills
  the engine.
- v0.1.5 adds intel_gpu_lnl_m to the allowlist. Pre-built wheel available
  on GitHub releases — no source build needed.
- Phase 4 now force-reinstalls v0.1.5 right after requirements/xpu.txt.

Quick-start example updated:
- Shows --kv-cache-memory-bytes workaround (v0.19 replacement for our
  v0.14 VLLM_SKIP_PROFILE_RUN=1 patch — v0.19's MXFP4 profile-run peak
  busts the 28 GiB Lunar Lake budget even at max_num_seqs=4).
- Adds drop_caches / reboot hints for the XPU shared-memory leak that
  accumulates across crashed launches.

Findings doc (issues/vllm-v19-lunar-lake-findings.md) captures:
- Exact v0.1.4 vs v0.1.5 diff in csrc/utils.h::is_xe2_arch
- Why --kv-cache-memory-bytes is needed on Lunar Lake
- XPU shared-memory leak symptom and workaround
- What's carried forward from v0.14 vs what's new in v0.19

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
End-to-end verification after v0.1.5 wheel fix (post-reboot):
- GPT-OSS-20B MXFP4 serving on v0.19 works.
- vllm bench serve (single-user, --ignore-eos):
    128 in/out:   19.1 tok/s avg, 43 ms TPOT
    1024 in/out:  15.2 tok/s avg, 67 ms TPOT
- Steady-state decode ≈ 23 tok/s — on par with v0.14 IPEX
  GatedMLPMOE(is_mxfp4=True) baseline of ~22 tok/s. MXFP4 gets no
  meaningful speedup on v0.19; real payoff is expected for INT4
  AutoRound MoE (not tested yet).

Added openclaw integration config (matches local/gpt-oss-20b provider
on 127.0.0.1:8080, uses --served-model-name so short ids resolve).

Expanded Finding 3 (XPU shared-memory leak):
- Not confined to SIGKILL — Ctrl+C on graceful vllm serve also leaks.
- Observed pattern of degrading XPU free across the session.
- `sync && echo 3 > drop_caches` usually recovers enough to relaunch;
  below ~15 GiB free after cache drop indicates a real L0 mapping
  leak and only reboot resets it.

Memory budget math at 64k context documented (48 KB/token/layer).
KV-cache-memory-bytes config for single-seq vs two-seq worked out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kernel

Tonight's finding: vllm_xpu_kernels v0.1.5 has a real, correct INT4 MoE
kernel (cutlass_grouped_gemm_interface with is_B_int4=True). Verified
numerically against a CPU fp32 dequant+matmul reference using Qwen3-VL-
30B-A3B's layer 0 expert 0 gate_proj weights:

  rel error:   0.003      (bf16 noise floor)
  correlation: 1.000
  range:       [-3.27, 3.09]  vs ref [-3.26, 3.10]

v0.1.4 had `is_B_int4` actually routing to FP4 E2M1 (documented bug in
earlier findings). v0.1.5 separates the two via scale dtype: uint8 B with
bf16/fp16 scales = real INT4; uint8 B with uint8 E8M0 scales = MXFP4.

vLLM 0.19 itself doesn't wire this up — the default GPTQ MoE path calls
CUDA-only torch.ops._C.gptq_shuffle, the AutoRound path raises explicit
NotImplementedError during 'xpu kernel migration', and AWQ also hits the
missing gptq_marlin_repack kernel. RFC #33214 confirms int4_moe_support
has no PR yet.

Patch adds an XPU branch to GPTQMarlinMoEMethod that:
  - process_weights_after_loading: GPTQ int32-packed [E, K/8, 2N]
    -> uint8 2's-complement [E, 2N, K/2]; scales transposed to
    [E, 2N, K/GS]. All done per-expert on CPU to avoid XPU allocator
    pressure, original weights freed before new ones materialize.
  - apply: skips upstream CUDA kernel, uses topk -> moe_rows_counts ->
    moe_scatter -> cutlass_grouped_gemm(is_B_int4=True) x2 with silu_and_mul
    -> moe_gather pipeline.

Still to do before this can ship:
  - Apply patch to a live v0.19 venv and test Qwen3-30B-a3b-gptq-int4
    end-to-end
  - Verify the xpu_int4_ready flag is reached before the modular XPUExperts
    path takes over
  - Handle GPTQ asym mode (desc_act, non-default qzeros)

The isolated-kernel test is the crucial bit. Wiring is the cheap part —
our v0.14 patch a7ee41d followed the same shape against a broken kernel
(v0.1.4 FP4) and got garbled output; against the now-verified v0.1.5
kernel it should finally produce coherent output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First end-to-end success for a non-MXFP4 MoE model on vLLM 0.19 + Intel
Arc 140V. Verified with Qwen3-30B-a3b-gptq-int4:

  weights load:  15.56 GiB in 40.1s
  decode:        100 tokens in 8.67s = 11.5 tok/s
  output:        coherent English, no garbling

Prior state on v0.14 + our patches (Python sequential expert loop via
int4_gemm_w4a16): 0.9 tok/s. This is a ~12.8x speedup — the batched
CUTLASS int4 kernel we verified earlier (v0.1.5, rel=0.003, corr=1.000)
now delivers in practice.

Two patches combined:

1. xpu_gptq_awq_linear_int4_pr33662.patch
   Upstream PR #33662 ("[XPU][3/N] add int4 gemm support for xpu").
   Covers LINEAR gptq/awq layers (q/k/v/o_proj, lm_head, etc.) which
   previously tripped on CUDA-only torch.ops._C.gptq_shuffle. Uses
   vllm_xpu_kernels v0.1.5's GPTQUtils / transpose_onednn_woq_format
   helpers + int4_gemm_w4a16 dispatch.

2. xpu_gptq_moe_int4_cutlass_v19.patch (updated from draft, now a
   clean diff against v0.19.0 upstream)
   Our MoE half — adds XPU branch to GPTQMarlinMoEMethod:
     process_weights_after_loading: int32-packed GPTQ -> uint8
       2's-comp [E, 2N, K/2], scales [E, 2N, K/GS]
     apply: topk -> moe_rows_counts -> moe_scatter ->
       cutlass_grouped_gemm(is_B_int4=True) x2 + silu_and_mul ->
       moe_gather

PR #33662 alone cannot load a GPTQ MoE model (linear layers prep works
but MoE hits gptq_shuffle). Our MoE patch alone can't boot a model
whose linear layers also use GPTQ (same trip point at load). Both
required.

Remaining work:
- AutoRound routing (Bug-A-style) to send XPU FusedMoE to
  GPTQMarlinMoEMethod, bypassing 'INC not supported during xpu
  kernel migration'. Unlocks Qwen3-VL-30B, GLM-4.7-Flash, Qwen3.5.
- AWQ MoE (analog of our GPTQ MoE patch).
- GPTQ asym (desc_act) qzeros handling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…of INC stub

AutoRound's auto_round:auto_gptq and auto_round:auto_awq packing formats
are bit-compatible with plain GPTQ and AWQ. INCConfig.apply_gptq_quant_layer
already constructs GPTQMarlinMoEMethod / GPTQMarlinLinearMethod for these
models — the only bug is get_quant_method short-circuiting XPU to
apply_ipex_quant_layer which raises NotImplementedError during the XPU
kernel migration.

Patch adds a 10-line XPU branch in get_quant_method: route AutoRound to
apply_gptq_quant_layer or apply_awq_quant_layer directly, before the
legacy IPEX dispatch. Both destinations have XPU-aware apply() now:
  - Linear layers: via PR #33662 (int4_gemm_w4a16)
  - GPTQ MoE: via our xpu_gptq_moe_int4_cutlass_v19.patch
             (cutlass_grouped_gemm_interface with is_B_int4=True)

End-to-end test with Qwen3-VL-30B-A3B blocked by XPU shared-memory leak
from the successful Qwen3-30B GPTQ run (preflight reported 0.75 GiB free
— needs reboot to clear). Syntax verified; routing logic is a direct
mirror of v0.14's autoround_fusedmoe_ipex_routing.patch (Bug A) which
went through the same path.

Three patches now compose the XPU INT4 MoE stack on v0.19:
  1. xpu_gptq_awq_linear_int4_pr33662.patch      (upstream PR #33662)
  2. xpu_gptq_moe_int4_cutlass_v19.patch          (our MoE add)
  3. xpu_autoround_route_to_gptq_awq_v19.patch    (this — routing shim)

AWQ MoE (qwen3-coder-30b-a3b-awq, glm-4.7-flash-awq-4bit) still not
covered — would need a fourth patch mirroring our GPTQ MoE logic against
AwqMoEMethod.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First-ever coherent output for this model on Lunar Lake. Prior v0.14
stack produced silent zero-output (garbled tokens) because the IPEX
int4 kernel was FP4-aliased. v0.19 + four-patch stack uses the real
CUTLASS int4 path (verified rel=0.003, corr=1.000 against CPU fp32 ref)
and gets both correctness AND a 10x speedup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EX ops

Prior MoE patch called torch.ops.torch_ipex.moe_rows_counts / moe_scatter /
moe_gather / silu_and_mul — those live in intel_extension_for_pytorch, which
is archived and not installed in v0.19. The earlier 11.5 tok/s Qwen3-30B GPTQ
claim was an artifact of v0.14 IPEX residual in the test environment.

In v0.19, vllm_xpu_kernels ships a complete xpu_fused_moe() entry point
(fused_moe_interface.py) that handles routing + grouped gemm + activation +
gather, using _moe_C / _xpu_C / _C namespaces. It also applies the zp=8
transform internally via implement_zp(), so weights are packed as raw u4
nibbles (no -8 subtract at pack time).

Also add compressed_tensors (AWQ/pack-quantized) MoE patch that mirrors the
GPTQ path. Verified on Qwen3-Coder-30B-A3B-Instruct-AWQ-4bit on Lunar Lake:
coherent output at ~22 tok/s decode.
Bench recipe and results for 1024-in / 1024-out via `vllm bench serve`.
Also documents the IPEX-ops bug in the earlier MoE patch (archived IPEX
ops not installed in v0.19), the fix to use vllm_xpu_kernels
xpu_fused_moe() directly, and flags prior 11.5/9.3 tok/s claims as
pending re-verification against commit f47f4a2.
GPT-OSS-20B launcher moved to 8081 so port 8080 can host
Qwen3-Coder-30B AWQ as the default local model. openclaw config
routes `local/qwen3-coder` (primary) to 8080 and
`local-gptoss/gpt-oss-20b` (fallback) to 8081.
Same bench recipe (1024/1024/3) across all three Qwen3-30B-A3B INT4
variants using the fixed xpu_fused_moe() path:

- qwen3-coder-awq (group_size=32):      18.22 tok/s
- qwen3-30b-gptq (group_size=128):       9.97 tok/s
- qwen3-vl-autoround (group_size=128):   9.68 tok/s

Same architecture, same kernel, same patches. group_size=32 path is
~1.83x faster than group_size=128 in vllm_xpu_kernels v0.1.5. GPTQ and
AutoRound land within 3% of each other, confirming AutoRound->GPTQ XPU
routing exposes the same code path.
LNL validation data for Intel's compressed-tensors W4A16 MoE draft PR:
Qwen3-Coder-30B AWQ at 18.22 tok/s, plus group_size=32 vs 128 observation
and LNL-specific install gotchas. Draft only — not posted yet.
Opt-in VLLM_XPU_UNIFIED_MEMORY env var for Lunar Lake / Arrow Lake /
Panther Lake iGPUs. Draft only — post from laptop when ready.
…B ceiling

New findings:
- mem_get_info.free underreports on unified memory; patched mem_utils.py
  to use psutil.virtual_memory().available on XPU (mirrors existing
  CUDA UMA branch). Upstream PR branch at MegaStood/vllm:xpu-integrated-gpu-mem
  uses opt-in VLLM_XPU_UNIFIED_MEMORY env var (safer than memory-ratio
  heuristic for PVC-class false positives).

- v0.19.0 regression: unify_kv_cache_spec_page_size() hard-fails on
  hybrid attention+Mamba models (issue #38979, not in v0.19.1). Cherry-
  picked PR #40128's LCM-based padding fix onto kv_cache_utils.py.

- Hard ceiling on Lunar Lake: ~21 GiB model weights on 30.9 GiB system.
  AutoRound fits (Qwen3.5-35B at 21 GiB = 8.94 tok/s); GPTQ at 23 GiB
  does not fit by physics (util must be 0.91, free check fails).

- Added GLM-4.7-Flash (AWQ 6.58 / AR 4.71 tok/s) and Qwen3.5-35B AR
  (8.94 tok/s) to the benchmark matrix. Confirms gs=32 ~1.8x speedup
  and GQA dominates KV bandwidth.

Outstanding: capture in-venv patches as .patch files, open upstream PR.
…rry-pick

Two new patch files for v0.19.0:

1. xpu_unified_memory_free_check_v19.patch (13 lines)
   Adds `elif current_platform.is_xpu()` branch to
   MemorySnapshot.measure() that uses psutil.virtual_memory().available
   instead of mem_get_info.free. Mirrors the existing Orin/Thor/Spark
   CUDA UMA branch. Fixes the pre-init free-memory check on Lunar Lake
   where page cache from reading safetensors makes mem_get_info.free
   underreport by ~5-10 GiB.

2. xpu_hybrid_moe_page_size_pr40128.patch (89 lines)
   Cherry-picks PR #40128's LCM-based padding fix onto v0.19.0's
   unify_kv_cache_spec_page_size(). Required to run Qwen3.5-35B-A3B
   AutoRound (hybrid attention+Mamba architecture) which trips issue
   #38979 otherwise.

Both verified with Qwen3.5-35B AutoRound loading successfully and
benching at 8.94 tok/s on Arc 140V / Lunar Lake.
Validates Sandermage's hybrid KV cache page-size fix on a different
hybrid architecture (Qwen3.5-35B-A3B-AutoRound on Lunar Lake iGPU).
Draft only — post from laptop when ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants