Skip to content

[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639

Open
erwei-xilinx wants to merge 10 commits into
Xilinx:mainfrom
erwei-xilinx:int4-awq-gemm
Open

[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639
erwei-xilinx wants to merge 10 commits into
Xilinx:mainfrom
erwei-xilinx:int4-awq-gemm

Conversation

@erwei-xilinx
Copy link
Copy Markdown
Collaborator

@erwei-xilinx erwei-xilinx commented Jun 1, 2026

Summary

Adds an int4-AWQ GEMM (matmul_int4_bf16_packed_f32) alongside the existing matvec entries in mv_int4_bf16.cc, plus a programming_examples/matrix_multiplication/int4_awq/ host builder + Makefile + lit tests.

The kernel:

  1. Dequant W (NO scale fold) into mmul-packed bf16 b_pack; pack A into mmul-packed bf16 a_pack.
  2. Per-group MMUL via aie::mmul<8,8,8,bf16,bf16,accfloat> produces an f32 tile per group.
  3. Post-MMUL scale fold: convert MMUL accum to bf16, multiply by bf16 scale broadcast (mul → f32 accum, no extra truncate), accumulate into the persistent f32 c. One bf16 truncate per output element per group instead of per W element — matches the mac-kernel precision pattern.
  4. f32 c kept across host K-chunk calls (bf16 GEMM accumulator pattern). After the K loop a one-shot f32_to_bf16_mn converts the L1 C tile to bf16 for the drain. C never round-trips through bf16 mid-accumulation.

Same packed Q+S+Z BO layout as the GEMV — one mv_int4_bf16.o serves both decode (GEMV) and prefill (GEMM).

Performance (NPU2, M=N=K=2048, herd 8x4)

Kernel Per-iter Throughput Note
(alternative) int4 via per-element aie::mac 198 ms 87 GOPS rejected; 4x off bf16 peak
int4 via aie::mmul + post-MMUL fold + f32 c (this PR) 117 ms 147 GOPS 1.7x faster than mac, 44% of bf16 ceiling
bf16 GEMM via aie::mmul (ceiling reference) 51 ms 337 GOPS dense bf16, no quantization

Remaining gap to the bf16 ceiling: the dequant scatter store (R K-values for fixed N scatter across R/s k-blocks at stride t in the [s][t] inner layout). A vectorized scatter — processing multiple N's in parallel per k-block to enable contiguous stores — could close most of the remaining 2.3x; deferred until a prefill-bound workload demands it.

Host builder design

A 2D herd over (M, N). Each PE owns a (tile_m, tile_n) C-tile and serially accumulates K into a per-PE f32 L1 C accumulator, plus a separate bf16 L1 C drain buffer that the post-K-loop convert kernel writes into. The launch space tiles (M, N) across launches; the K-L2 loop is collapsed to 1 iter (TILE_K_L2 = K) so the f32 accumulator survives across all K_CHUNK steps. Drain DMAs the bf16 buffer into a 4D L2 C [herd_m, herd_n, tile_m, tile_n] using per-PE dst_offsets=[_tx, _ty, 0, 0].

This replaces an earlier 1D-herd-over-N design that put M on the launch axis and hit a correctness cliff at M_div >= 4 (compiler-side air-opt-shim-dma-bds unrolls non-zero-stride launch-axis iterations into per-iter shim BDs — fine at M_div=2, silent wrong at 4-5, HW hang at 6-8, BD-pool exhaustion at >=16). The 2D-herd structure sidesteps that path entirely.

Files

File What
mv_int4_bf16.cc Adds mm_int4_bf16_mmul_impl<float>, matmul_int4_bf16_packed_f32, zero_vectorized_bf16_mn, zero_vectorized_f32_mn, f32_to_bf16_mn. New DIM_K_CHUNK macro (default 128) decouples the matmul's per-call k_chunk from the matvec's full DIM_K. Existing GEMV symbols and ABI unchanged.
matmul_int4_packed.py Host builder. @launch(sizes=[M/(tile_m*herd_m), N/(tile_n*herd_n)]) + @segment (L2 staging) + @herd(sizes=[herd_m, herd_n]) with per-PE f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel call between the K loop and the drain.
Makefile Builds the shared mv_int4_bf16.o (passes DIM_K_CHUNK=$(TILE_K_L1)). Targets: run_packed, run1x1, run2x4, run8x4, run_llama_qproj.
run_packed_npu2_small_peano.lit Smoke: M=64 K=128 N=128, herd 2x4.
run_packed_npu2_llama_qproj_seq32_peano.lit Llama-3.2-1B Q-projection at prefill seq=32: M=32 K=2048 N=2048, herd 2x4.

Notable: zero_vectorized_bf16_mn is not just a size variant

Peano on AIE2P auto-vectorizes a scalar for c[i]=0 loop with a stride-4 store that silently skips every 4th element once the buffer is >= one full vector wide (32 bf16). The existing GEMV's zero_impl<DIM_M=8> stays under that threshold so it was unaffected, but the larger GEMM C tile (e.g. 16x16 = 256 bf16) triggers the bug. Fix: explicit aie::store_v(aie::zeros<bfloat16, 32>()). The existing zero_impl<m> is left untouched. The new zero_vectorized_f32_mn follows the same explicit-vector-store pattern.

Validation (NPU2)

Shape Herd Correlation Mismatches (atol=0.05) Result
M=16 K=128 N=16 (1x1 smoke) 1x1 0.999967 0 / 256 PASS
M=64 K=128 N=128 2x4 0.999974 0 / 8192 PASS
M=256 K=128 N=128 8x4 0.999973 0 / 32768 PASS
Llama Q-proj seq=32 (M=32 K=N=2048) 2x4 0.999974 11 / 65536 (0.017%) PASS
Llama Q-proj prefill (M=N=K=2048) 8x4 0.999975 small PASS
GEMV regression (existing matvec_int4_packed lit) 0.999997 unchanged 0 PASS

max_mismatch_percentage=0.05 in the host script bounds the small bf16-noise-floor mismatches at large K to ≤0.05% of output elements; min_correlation=0.999 remains the primary correctness check.

Test plan

  • lit invocation of run_packed_npu2_small_peano.lit — PASS
  • lit invocation of run_packed_npu2_llama_qproj_seq32_peano.lit — PASS
  • Llama Q-proj at full prefill scale (M=N=K=2048, herd 8x4) — PASS
  • Existing matrix_vector_multiplication/int4_awq/run_packed_* lits still PASS (shared kernel; GEMV symbols/ABI unchanged)
  • Stitch this GEMM into an int4 prefill ELF for llama32_1b (follow-up)

Known limitations

  1. TILE_K_L2 = K required — the f32 L1 C accumulator lives only for one herd invocation and the segment-level K-L2 loop would re-zero it. For K=2048 herd 8x4, L2 A = 8 * 16 * 2048 * 2 = 512 KB — fits one aie2p memtile. To scale K beyond a single memtile's capacity, the K-L2 loop will need to move inside the herd (channels-based feed), or the multi-herd zero+compute+drain pattern needs to work with func.call (currently the air-dependency pass doesn't track CallOp writes through memref.subview, so a herd-wide L1 C with per-PE subviews can't be used today).

  2. Dequant scatter store is the perf floor — closes 1.7x of the 4x gap to bf16 MMUL. A vectorized scatter (parallel-N dequant for contiguous stores per k-block) could push closer to the ceiling; deferred until prefill is the bottleneck.

Known follow-ups

  • Int4 prefill ELFs for llama32_1b stitching this GEMM (Q/K/V/O projections, FFN gate/up/down) into multi-launch ELFs analogous to the existing bf16 prefill ELFs.

Adds matmul_int4_bf16_packed alongside the existing matvec entries in
mv_int4_bf16.cc, plus a programming_examples/matrix_multiplication/int4_awq
host builder, Makefile and lit tests. The GEMM reuses the GEMV's Q+S+Z
per-tile packed BO layout (output-major), extended to m_tile activation
rows; one .o file serves both decode (GEMV) and prefill (GEMM).

Also adds zero_vectorized_bf16_mn — explicit aie::store_v of aie::zeros
for the larger GEMM C tile. Peano auto-vectorizes a scalar `for c[i]=0`
loop on AIE2P with a stride-4 store that skips every 4th element once
the buffer is >= one full vector wide; manifests as repeated kernel
calls reading stale c[]. The existing GEMV's DIM_M=8 stays under the
vectorization threshold so it was unaffected.

Tested on NPU2:
  - Smoke (M=32 K=128 N=64, exercises M_div>1 + N_div>1): corr 0.999997
  - Llama Q-proj at prefill seq=32 (M=32 K=2048 N=2048): corr 0.999985
  - GEMV regression (M=2048 K=2048): unchanged, corr 0.999997

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 1, 2026 05:46
@erwei-xilinx erwei-xilinx requested a review from jgmelber as a code owner June 1, 2026 05:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an int4-AWQ packed GEMM (matmul) path alongside the existing packed GEMV micro-kernels, plus a standalone matrix-multiplication programming example (builder + Makefile + NPU2 lit coverage) that reuses the same compiled mv_int4_bf16.o kernel object.

Changes:

  • Extend mv_int4_bf16.cc with matmul_int4_bf16_packed, a GEMM inner kernel (mm_int4_bf16_impl), and a GEMM-safe vectorized zero helper (zero_vectorized_bf16_mn).
  • Add a new programming_examples/matrix_multiplication/int4_awq/ example with a host builder (matmul_int4_packed.py) and Makefile that compiles the shared kernel from the matvec directory.
  • Add two NPU2+Peano lit smoke/correctness tests for the new GEMM example and register it in the programming examples dashboard generator.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
programming_examples/matrix_vector_multiplication/int4_awq/mv_int4_bf16.cc Adds packed int4-AWQ GEMM kernel entry point and GEMM-oriented zeroing helper alongside existing GEMV micro-kernels.
programming_examples/matrix_multiplication/int4_awq/matmul_int4_packed.py Introduces a standalone AIR/XRT builder for packed int4-AWQ GEMM using the shared kernel object.
programming_examples/matrix_multiplication/int4_awq/Makefile Builds/runs the GEMM example while reusing mv_int4_bf16.cc from the matvec example directory.
programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_small_peano.lit Adds a small NPU2 smoke test for the packed GEMM flow.
programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_llama_qproj_seq32_peano.lit Adds an NPU2 correctness test based on Llama Q-proj prefill sizing.
programming_examples/generate_readme.py Registers the new GEMM int4-AWQ example for the programming examples dashboard.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread programming_examples/matrix_multiplication/int4_awq/matmul_int4_packed.py Outdated
Comment thread programming_examples/matrix_vector_multiplication/int4_awq/mv_int4_bf16.cc Outdated
erwei-xilinx and others added 2 commits May 31, 2026 22:56
- matmul_int4_packed.py: assert kernel-side static_assert constraints
  (GS % 32 == 0, M_TILE*N_TILE % 32 == 0) at module-build time so
  unsupported tilings fail with a Python message instead of a C++
  template/static_assert error from the kernel build.

- mv_int4_bf16.cc: restructure mm_int4_bf16_impl loop nest to (mi, g, i, n)
  so each activation load a_vec(mi, g, i) is reused across all n_tile
  output columns instead of being reloaded per n. Per-(g, n) zero-point
  broadcasts and per-(mi, n) accumulators are hoisted out of the i loop
  too. Inner hot path stays load-packed + unpack + sub + cvt + mac.

Acceptance unchanged on NPU2:
  - Smoke M=32 K=128 N=64: corr 0.999997
  - Llama Q-proj seq=32 M=32 K=2048 N=2048: corr 0.999985
  - GEMV regression M=2048 K=2048: corr 0.999997

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the 1D-herd-over-N design with a 2D herd over (M, N) where each
PE accumulates K serially into a per-PE row-major bf16 L1 C. Drains via
a 4D L2 C [herd_m, herd_n, tile_m, tile_n] with per-PE dst_offsets.

Verified PASS at herd 1x1, 2x4, 8x4 (smoke shapes) and at Llama-3.2-1B
Q-projection shape M=N=K=2048 herd 8x4 (correlation 0.999986).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx changed the title [programming_examples] int4-AWQ GEMM kernel + standalone matmul example [programming_examples] int4-AWQ GEMM (packed Q+S+Z) with 2D herd, Llama Q-proj scale Jun 1, 2026
Replaces the per-(mi, g, i, n) aie::mac inner loop with a two-phase
kernel: (1) pack A row-major → mmul-packed [KB][MB][r][s] and dequant W
(scale folded in) into mmul-packed [NB][KB][s][t]; (2) aie::mmul<8,8,8,
bf16,bf16,accfloat> loop. Same kernel symbol and ABI, drop-in.

Measured at M=N=K=2048 herd 8x4: 198 ms → 117 ms (1.7x). bf16 MMUL
ceiling at the same shape is 51 ms; remaining gap is dominated by the
scatter store in the dequant phase (32 scalar stores per (g, n, i)
across a strided [s][t] inner layout).

Correlation 0.999955+ at all tested shapes (1x1, 2x4, 8x4, Llama Q-proj).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx changed the title [programming_examples] int4-AWQ GEMM (packed Q+S+Z) with 2D herd, Llama Q-proj scale [programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1 (1.7x over mac) Jun 1, 2026
erwei-xilinx and others added 3 commits June 1, 2026 11:56
clang-format-17 on mv_int4_bf16.cc and black on matmul_int4_packed.py
to fix the format CI check. Also surfaces the int4 GEMM kernel-side
static_asserts (tile_m/n/k_l1 % 8 for mmul, gs % 32 for dequant,
tile_m*tile_n % 32 for the zero kernel) as Python asserts at module-
build time so unsupported tilings fail with a clear message rather
than a C++ template/static_assert during compile-kernel.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues addressed:

1. CI compile failure on amdhx370: the matmul template was instantiating
   with DIM_K=2048 when the host build only cared about DIM_K for the
   matvec — overflowing the AIE2P assembly printer's immediate range on
   the matmul's scratch buffer addressing. Added a separate DIM_K_CHUNK
   macro (default 128) used by mm_int4_bf16_mmul_impl, decoupled from
   matvec's DIM_K. matvec callers that only set DIM_K still build cleanly.

2. f32 accumulator across host K-chunk calls. matmul_int4_bf16_packed
   renamed to matmul_int4_bf16_packed_f32 and now takes float* c. New
   helpers zero_vectorized_f32_mn and f32_to_bf16_mn handle the L1 C
   init and the final bf16 narrowing once per launch. Host builder adds
   an f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel
   call between the K loop and the drain.

3. Post-MMUL scale fold. Dequant now produces UNSCALED bf16 W; per-group
   MMUL accumulates in f32; convert to f32 vec via row-by-row extract
   (the 64-element store_v gives a bad layout — using the same
   extract<t>(m_i) pattern as the working pre-MMUL kernel fixes it);
   scalar multiply by per-n bf16 scale (lifted to f32) and accumulate
   into the c tile. One bf16 truncate per output element per group
   instead of per W element — matches mac kernel's precision pattern.

Correlation at Llama Q-proj seq=32 (M=32 K=N=2048): 0.999945 → 0.999975.
Mismatch count (atol=0.05) dropped from ~6800 to ~11 / 65536 (0.017%).
max_mismatch_percentage=0.05 in the host script bounds this at 32
elements with margin — correlation > 0.999 remains the primary check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 64-iter scalar mul+add per (m_b, n_b) per group with a
vector mul+add chain: bf16 scale broadcast → aie::mul(c_g_bf16,
scale_tile) → f32 accum → row-by-row add into c_acc_buf. The bf16
mul is supported on aie2p (f32 vector mul isn't) and produces an
f32 accumulator with no extra truncate.

Restores throughput to 117 ms / 147 GOPS at M=N=K=2048 herd 8x4 —
matches the prior (precision-broken) MMUL kernel's speed AND keeps
the post-MMUL fold + f32 c accumulator precision pattern. Net result
vs the rejected mac kernel: 1.7x faster (198 → 117 ms) AND more
precise (correlation 0.999974 vs ~0.99995, mismatches stay tiny).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@erwei-xilinx erwei-xilinx changed the title [programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1 (1.7x over mac) [programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac) Jun 1, 2026
erwei-xilinx and others added 3 commits June 1, 2026 17:41
The int4 GEMM micro-kernel previously used 1x1 mmul (single accumulator
chain) with a 32-scalar-store dequant scatter — both vs the 2x2 mmul
expansion + contiguous-store layout that the bf16 baseline uses. At
Q-proj shape (M=N=K=2048, herd 8x4) the kernel ran 117 ms while bf16
ran 61 ms despite having less weight bytes to move.

Changes:
- 2x2 mmul: 4 independent accumulators C00/C01/C10/C11, A and B vector
  reuse 2x per inner kg iter, chess_prepare_for_pipelining hint
- Dequant b_pack layout swapped to [NB][KB][t][s] (n_i outer, k_i
  inner) so 8 k_i values land contiguously per (n_i, k_b) — replaces
  32 scalar stores per inner iter with 4 vector stores
- aie::transpose(B, t, s) at mmul load flips back to the mmul-expected
  [s][t] order in-register, avoiding any host-side per-tile Q repack
  (keeps pack_inputs consistent with the canonical AWQ tile layout)

Result at Q-proj M=N=K=2048, herd 8x4: 117 ms → 39.5 ms (2.96x). Now
1.55x faster than bf16 baseline (61 ms) at the same shape. Smoke
(M=64 K=128 N=128, herd 2x4) still PASS at corr 0.999974.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2x2-expansion static_assert in mm_int4_bf16_mmul_impl requires
m_tile and n_tile to be multiples of 16. Matvec lit tests build the
same .cc with DIM_M=8, which doesn't link any matmul symbol but still
instantiates the matmul template via matmul_int4_bf16_packed_f32 and
trips the assert.

Guard the matmul entry + helpers with #if DIM_M >= 16 && DIM_N >= 16
so matvec builds skip the matmul template instantiation. Confirmed
locally: matvec.o (DIM_M=8) builds cleanly with no matmul symbols;
matmul.o (DIM_M=16) builds cleanly with full symbol set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants