[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac) by erwei-xilinx · Pull Request #1639 · Xilinx/mlir-air

erwei-xilinx · 2026-06-01T05:46:00Z

Summary

Adds an int4-AWQ GEMM (matmul_int4_bf16_packed_f32) alongside the existing matvec entries in mv_int4_bf16.cc, plus a programming_examples/matrix_multiplication/int4_awq/ host builder + Makefile + lit tests.

The kernel:

Dequant W (NO scale fold) into mmul-packed bf16 b_pack; pack A into mmul-packed bf16 a_pack.
Per-group MMUL via aie::mmul<8,8,8,bf16,bf16,accfloat> produces an f32 tile per group.
Post-MMUL scale fold: convert MMUL accum to bf16, multiply by bf16 scale broadcast (mul → f32 accum, no extra truncate), accumulate into the persistent f32 c. One bf16 truncate per output element per group instead of per W element — matches the mac-kernel precision pattern.
f32 c kept across host K-chunk calls (bf16 GEMM accumulator pattern). After the K loop a one-shot f32_to_bf16_mn converts the L1 C tile to bf16 for the drain. C never round-trips through bf16 mid-accumulation.

Same packed Q+S+Z BO layout as the GEMV — one mv_int4_bf16.o serves both decode (GEMV) and prefill (GEMM).

Performance (NPU2, M=N=K=2048, herd 8x4)

Kernel	Per-iter	Throughput	Note
(alternative) int4 via per-element `aie::mac`	198 ms	87 GOPS	rejected; 4x off bf16 peak
int4 via `aie::mmul` + post-MMUL fold + f32 c (this PR)	117 ms	147 GOPS	1.7x faster than mac, 44% of bf16 ceiling
bf16 GEMM via `aie::mmul` (ceiling reference)	51 ms	337 GOPS	dense bf16, no quantization

Remaining gap to the bf16 ceiling: the dequant scatter store (R K-values for fixed N scatter across R/s k-blocks at stride t in the [s][t] inner layout). A vectorized scatter — processing multiple N's in parallel per k-block to enable contiguous stores — could close most of the remaining 2.3x; deferred until a prefill-bound workload demands it.

Host builder design

A 2D herd over (M, N). Each PE owns a (tile_m, tile_n) C-tile and serially accumulates K into a per-PE f32 L1 C accumulator, plus a separate bf16 L1 C drain buffer that the post-K-loop convert kernel writes into. The launch space tiles (M, N) across launches; the K-L2 loop is collapsed to 1 iter (TILE_K_L2 = K) so the f32 accumulator survives across all K_CHUNK steps. Drain DMAs the bf16 buffer into a 4D L2 C [herd_m, herd_n, tile_m, tile_n] using per-PE dst_offsets=[_tx, _ty, 0, 0].

This replaces an earlier 1D-herd-over-N design that put M on the launch axis and hit a correctness cliff at M_div >= 4 (compiler-side air-opt-shim-dma-bds unrolls non-zero-stride launch-axis iterations into per-iter shim BDs — fine at M_div=2, silent wrong at 4-5, HW hang at 6-8, BD-pool exhaustion at >=16). The 2D-herd structure sidesteps that path entirely.

Files

File	What
`mv_int4_bf16.cc`	Adds `mm_int4_bf16_mmul_impl<float>`, `matmul_int4_bf16_packed_f32`, `zero_vectorized_bf16_mn`, `zero_vectorized_f32_mn`, `f32_to_bf16_mn`. New `DIM_K_CHUNK` macro (default 128) decouples the matmul's per-call k_chunk from the matvec's full `DIM_K`. Existing GEMV symbols and ABI unchanged.
`matmul_int4_packed.py`	Host builder. `@launch(sizes=[M/(tile_mherd_m), N/(tile_nherd_n)])` + `@segment` (L2 staging) + `@herd(sizes=[herd_m, herd_n])` with per-PE f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel call between the K loop and the drain.
`Makefile`	Builds the shared `mv_int4_bf16.o` (passes `DIM_K_CHUNK=$(TILE_K_L1)`). Targets: `run_packed`, `run1x1`, `run2x4`, `run8x4`, `run_llama_qproj`.
`run_packed_npu2_small_peano.lit`	Smoke: M=64 K=128 N=128, herd 2x4.
`run_packed_npu2_llama_qproj_seq32_peano.lit`	Llama-3.2-1B Q-projection at prefill seq=32: M=32 K=2048 N=2048, herd 2x4.

Notable: `zero_vectorized_bf16_mn` is not just a size variant

Peano on AIE2P auto-vectorizes a scalar for c[i]=0 loop with a stride-4 store that silently skips every 4th element once the buffer is >= one full vector wide (32 bf16). The existing GEMV's zero_impl<DIM_M=8> stays under that threshold so it was unaffected, but the larger GEMM C tile (e.g. 16x16 = 256 bf16) triggers the bug. Fix: explicit aie::store_v(aie::zeros<bfloat16, 32>()). The existing zero_impl<m> is left untouched. The new zero_vectorized_f32_mn follows the same explicit-vector-store pattern.

Validation (NPU2)

Shape	Herd	Correlation	Mismatches (atol=0.05)	Result
M=16 K=128 N=16 (1x1 smoke)	1x1	0.999967	0 / 256	PASS
M=64 K=128 N=128	2x4	0.999974	0 / 8192	PASS
M=256 K=128 N=128	8x4	0.999973	0 / 32768	PASS
Llama Q-proj seq=32 (M=32 K=N=2048)	2x4	0.999974	11 / 65536 (0.017%)	PASS
Llama Q-proj prefill (M=N=K=2048)	8x4	0.999975	small	PASS
GEMV regression (existing `matvec_int4_packed` lit)	—	0.999997 unchanged	0	PASS

max_mismatch_percentage=0.05 in the host script bounds the small bf16-noise-floor mismatches at large K to ≤0.05% of output elements; min_correlation=0.999 remains the primary correctness check.

Test plan

lit invocation of run_packed_npu2_small_peano.lit — PASS
lit invocation of run_packed_npu2_llama_qproj_seq32_peano.lit — PASS
Llama Q-proj at full prefill scale (M=N=K=2048, herd 8x4) — PASS
Existing matrix_vector_multiplication/int4_awq/run_packed_* lits still PASS (shared kernel; GEMV symbols/ABI unchanged)
Stitch this GEMM into an int4 prefill ELF for llama32_1b (follow-up)

Known limitations

TILE_K_L2 = K required — the f32 L1 C accumulator lives only for one herd invocation and the segment-level K-L2 loop would re-zero it. For K=2048 herd 8x4, L2 A = 8 * 16 * 2048 * 2 = 512 KB — fits one aie2p memtile. To scale K beyond a single memtile's capacity, the K-L2 loop will need to move inside the herd (channels-based feed), or the multi-herd zero+compute+drain pattern needs to work with func.call (currently the air-dependency pass doesn't track CallOp writes through memref.subview, so a herd-wide L1 C with per-PE subviews can't be used today).
Dequant scatter store is the perf floor — closes 1.7x of the 4x gap to bf16 MMUL. A vectorized scatter (parallel-N dequant for contiguous stores per k-block) could push closer to the ceiling; deferred until prefill is the bottleneck.

Known follow-ups

Int4 prefill ELFs for llama32_1b stitching this GEMM (Q/K/V/O projections, FFN gate/up/down) into multi-launch ELFs analogous to the existing bf16 prefill ELFs.

Adds matmul_int4_bf16_packed alongside the existing matvec entries in mv_int4_bf16.cc, plus a programming_examples/matrix_multiplication/int4_awq host builder, Makefile and lit tests. The GEMM reuses the GEMV's Q+S+Z per-tile packed BO layout (output-major), extended to m_tile activation rows; one .o file serves both decode (GEMV) and prefill (GEMM). Also adds zero_vectorized_bf16_mn — explicit aie::store_v of aie::zeros for the larger GEMM C tile. Peano auto-vectorizes a scalar `for c[i]=0` loop on AIE2P with a stride-4 store that skips every 4th element once the buffer is >= one full vector wide; manifests as repeated kernel calls reading stale c[]. The existing GEMV's DIM_M=8 stays under the vectorization threshold so it was unaffected. Tested on NPU2: - Smoke (M=32 K=128 N=64, exercises M_div>1 + N_div>1): corr 0.999997 - Llama Q-proj at prefill seq=32 (M=32 K=2048 N=2048): corr 0.999985 - GEMV regression (M=2048 K=2048): unchanged, corr 0.999997 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds an int4-AWQ packed GEMM (matmul) path alongside the existing packed GEMV micro-kernels, plus a standalone matrix-multiplication programming example (builder + Makefile + NPU2 lit coverage) that reuses the same compiled mv_int4_bf16.o kernel object.

Changes:

Extend mv_int4_bf16.cc with matmul_int4_bf16_packed, a GEMM inner kernel (mm_int4_bf16_impl), and a GEMM-safe vectorized zero helper (zero_vectorized_bf16_mn).
Add a new programming_examples/matrix_multiplication/int4_awq/ example with a host builder (matmul_int4_packed.py) and Makefile that compiles the shared kernel from the matvec directory.
Add two NPU2+Peano lit smoke/correctness tests for the new GEMM example and register it in the programming examples dashboard generator.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
programming_examples/matrix_vector_multiplication/int4_awq/mv_int4_bf16.cc	Adds packed int4-AWQ GEMM kernel entry point and GEMM-oriented zeroing helper alongside existing GEMV micro-kernels.
programming_examples/matrix_multiplication/int4_awq/matmul_int4_packed.py	Introduces a standalone AIR/XRT builder for packed int4-AWQ GEMM using the shared kernel object.
programming_examples/matrix_multiplication/int4_awq/Makefile	Builds/runs the GEMM example while reusing `mv_int4_bf16.cc` from the matvec example directory.
programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_small_peano.lit	Adds a small NPU2 smoke test for the packed GEMM flow.
programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_llama_qproj_seq32_peano.lit	Adds an NPU2 correctness test based on Llama Q-proj prefill sizing.
programming_examples/generate_readme.py	Registers the new GEMM int4-AWQ example for the programming examples dashboard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- matmul_int4_packed.py: assert kernel-side static_assert constraints (GS % 32 == 0, M_TILE*N_TILE % 32 == 0) at module-build time so unsupported tilings fail with a Python message instead of a C++ template/static_assert error from the kernel build. - mv_int4_bf16.cc: restructure mm_int4_bf16_impl loop nest to (mi, g, i, n) so each activation load a_vec(mi, g, i) is reused across all n_tile output columns instead of being reloaded per n. Per-(g, n) zero-point broadcasts and per-(mi, n) accumulators are hoisted out of the i loop too. Inner hot path stays load-packed + unpack + sub + cvt + mac. Acceptance unchanged on NPU2: - Smoke M=32 K=128 N=64: corr 0.999997 - Llama Q-proj seq=32 M=32 K=2048 N=2048: corr 0.999985 - GEMV regression M=2048 K=2048: corr 0.999997 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the 1D-herd-over-N design with a 2D herd over (M, N) where each PE accumulates K serially into a per-PE row-major bf16 L1 C. Drains via a 4D L2 C [herd_m, herd_n, tile_m, tile_n] with per-PE dst_offsets. Verified PASS at herd 1x1, 2x4, 8x4 (smoke shapes) and at Llama-3.2-1B Q-projection shape M=N=K=2048 herd 8x4 (correlation 0.999986). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the per-(mi, g, i, n) aie::mac inner loop with a two-phase kernel: (1) pack A row-major → mmul-packed [KB][MB][r][s] and dequant W (scale folded in) into mmul-packed [NB][KB][s][t]; (2) aie::mmul<8,8,8, bf16,bf16,accfloat> loop. Same kernel symbol and ABI, drop-in. Measured at M=N=K=2048 herd 8x4: 198 ms → 117 ms (1.7x). bf16 MMUL ceiling at the same shape is 51 ms; remaining gap is dominated by the scatter store in the dequant phase (32 scalar stores per (g, n, i) across a strided [s][t] inner layout). Correlation 0.999955+ at all tested shapes (1x1, 2x4, 8x4, Llama Q-proj). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clang-format-17 on mv_int4_bf16.cc and black on matmul_int4_packed.py to fix the format CI check. Also surfaces the int4 GEMM kernel-side static_asserts (tile_m/n/k_l1 % 8 for mmul, gs % 32 for dequant, tile_m*tile_n % 32 for the zero kernel) as Python asserts at module- build time so unsupported tilings fail with a clear message rather than a C++ template/static_assert during compile-kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three issues addressed: 1. CI compile failure on amdhx370: the matmul template was instantiating with DIM_K=2048 when the host build only cared about DIM_K for the matvec — overflowing the AIE2P assembly printer's immediate range on the matmul's scratch buffer addressing. Added a separate DIM_K_CHUNK macro (default 128) used by mm_int4_bf16_mmul_impl, decoupled from matvec's DIM_K. matvec callers that only set DIM_K still build cleanly. 2. f32 accumulator across host K-chunk calls. matmul_int4_bf16_packed renamed to matmul_int4_bf16_packed_f32 and now takes float* c. New helpers zero_vectorized_f32_mn and f32_to_bf16_mn handle the L1 C init and the final bf16 narrowing once per launch. Host builder adds an f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel call between the K loop and the drain. 3. Post-MMUL scale fold. Dequant now produces UNSCALED bf16 W; per-group MMUL accumulates in f32; convert to f32 vec via row-by-row extract (the 64-element store_v gives a bad layout — using the same extract<t>(m_i) pattern as the working pre-MMUL kernel fixes it); scalar multiply by per-n bf16 scale (lifted to f32) and accumulate into the c tile. One bf16 truncate per output element per group instead of per W element — matches mac kernel's precision pattern. Correlation at Llama Q-proj seq=32 (M=32 K=N=2048): 0.999945 → 0.999975. Mismatch count (atol=0.05) dropped from ~6800 to ~11 / 65536 (0.017%). max_mismatch_percentage=0.05 in the host script bounds this at 32 elements with margin — correlation > 0.999 remains the primary check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the 64-iter scalar mul+add per (m_b, n_b) per group with a vector mul+add chain: bf16 scale broadcast → aie::mul(c_g_bf16, scale_tile) → f32 accum → row-by-row add into c_acc_buf. The bf16 mul is supported on aie2p (f32 vector mul isn't) and produces an f32 accumulator with no extra truncate. Restores throughput to 117 ms / 147 GOPS at M=N=K=2048 herd 8x4 — matches the prior (precision-broken) MMUL kernel's speed AND keeps the post-MMUL fold + f32 c accumulator precision pattern. Net result vs the rejected mac kernel: 1.7x faster (198 → 117 ms) AND more precise (correlation 0.999974 vs ~0.99995, mismatches stay tiny). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The int4 GEMM micro-kernel previously used 1x1 mmul (single accumulator chain) with a 32-scalar-store dequant scatter — both vs the 2x2 mmul expansion + contiguous-store layout that the bf16 baseline uses. At Q-proj shape (M=N=K=2048, herd 8x4) the kernel ran 117 ms while bf16 ran 61 ms despite having less weight bytes to move. Changes: - 2x2 mmul: 4 independent accumulators C00/C01/C10/C11, A and B vector reuse 2x per inner kg iter, chess_prepare_for_pipelining hint - Dequant b_pack layout swapped to [NB][KB][t][s] (n_i outer, k_i inner) so 8 k_i values land contiguously per (n_i, k_b) — replaces 32 scalar stores per inner iter with 4 vector stores - aie::transpose(B, t, s) at mmul load flips back to the mmul-expected [s][t] order in-register, avoiding any host-side per-tile Q repack (keeps pack_inputs consistent with the canonical AWQ tile layout) Result at Q-proj M=N=K=2048, herd 8x4: 117 ms → 39.5 ms (2.96x). Now 1.55x faster than bf16 baseline (61 ms) at the same shape. Smoke (M=64 K=128 N=128, herd 2x4) still PASS at corr 0.999974. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2x2-expansion static_assert in mm_int4_bf16_mmul_impl requires m_tile and n_tile to be multiples of 16. Matvec lit tests build the same .cc with DIM_M=8, which doesn't link any matmul symbol but still instantiates the matmul template via matmul_int4_bf16_packed_f32 and trips the assert. Guard the matmul entry + helpers with #if DIM_M >= 16 && DIM_N >= 16 so matvec builds skip the matmul template instantiation. Confirmed locally: matvec.o (DIM_M=8) builds cleanly with no matmul symbols; matmul.o (DIM_M=16) builds cleanly with full symbol set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 1, 2026 05:46

erwei-xilinx requested a review from jgmelber as a code owner June 1, 2026 05:46

Copilot started reviewing on behalf of erwei-xilinx June 1, 2026 05:46 View session

Copilot AI reviewed Jun 1, 2026

View reviewed changes

Comment thread programming_examples/matrix_multiplication/int4_awq/matmul_int4_packed.py Outdated

Comment thread programming_examples/matrix_vector_multiplication/int4_awq/mv_int4_bf16.cc Outdated

erwei-xilinx and others added 2 commits May 31, 2026 22:56

erwei-xilinx changed the title ~~[programming_examples] int4-AWQ GEMM kernel + standalone matmul example~~ [programming_examples] int4-AWQ GEMM (packed Q+S+Z) with 2D herd, Llama Q-proj scale Jun 1, 2026

erwei-xilinx changed the title ~~[programming_examples] int4-AWQ GEMM (packed Q+S+Z) with 2D herd, Llama Q-proj scale~~ [programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1 (1.7x over mac) Jun 1, 2026

erwei-xilinx and others added 3 commits June 1, 2026 11:56

erwei-xilinx changed the title ~~[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1 (1.7x over mac)~~ [programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac) Jun 1, 2026

erwei-xilinx and others added 3 commits June 1, 2026 17:41

fixup: clang-format

32769dd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639

[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639
erwei-xilinx wants to merge 10 commits into
Xilinx:mainfrom
erwei-xilinx:int4-awq-gemm

erwei-xilinx commented Jun 1, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erwei-xilinx commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (NPU2, M=N=K=2048, herd 8x4)

Host builder design

Files

Notable: zero_vectorized_bf16_mn is not just a size variant

Validation (NPU2)

Test plan

Known limitations

Known follow-ups

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

erwei-xilinx commented Jun 1, 2026 •

edited

Loading

Notable: `zero_vectorized_bf16_mn` is not just a size variant