[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639
Open
erwei-xilinx wants to merge 10 commits into
Open
[programming_examples] int4-AWQ GEMM via aie::mmul + dequant-to-L1, f32 c (1.7x over mac)#1639erwei-xilinx wants to merge 10 commits into
erwei-xilinx wants to merge 10 commits into
Conversation
Adds matmul_int4_bf16_packed alongside the existing matvec entries in mv_int4_bf16.cc, plus a programming_examples/matrix_multiplication/int4_awq host builder, Makefile and lit tests. The GEMM reuses the GEMV's Q+S+Z per-tile packed BO layout (output-major), extended to m_tile activation rows; one .o file serves both decode (GEMV) and prefill (GEMM). Also adds zero_vectorized_bf16_mn — explicit aie::store_v of aie::zeros for the larger GEMM C tile. Peano auto-vectorizes a scalar `for c[i]=0` loop on AIE2P with a stride-4 store that skips every 4th element once the buffer is >= one full vector wide; manifests as repeated kernel calls reading stale c[]. The existing GEMV's DIM_M=8 stays under the vectorization threshold so it was unaffected. Tested on NPU2: - Smoke (M=32 K=128 N=64, exercises M_div>1 + N_div>1): corr 0.999997 - Llama Q-proj at prefill seq=32 (M=32 K=2048 N=2048): corr 0.999985 - GEMV regression (M=2048 K=2048): unchanged, corr 0.999997 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an int4-AWQ packed GEMM (matmul) path alongside the existing packed GEMV micro-kernels, plus a standalone matrix-multiplication programming example (builder + Makefile + NPU2 lit coverage) that reuses the same compiled mv_int4_bf16.o kernel object.
Changes:
- Extend
mv_int4_bf16.ccwithmatmul_int4_bf16_packed, a GEMM inner kernel (mm_int4_bf16_impl), and a GEMM-safe vectorized zero helper (zero_vectorized_bf16_mn). - Add a new
programming_examples/matrix_multiplication/int4_awq/example with a host builder (matmul_int4_packed.py) and Makefile that compiles the shared kernel from the matvec directory. - Add two NPU2+Peano lit smoke/correctness tests for the new GEMM example and register it in the programming examples dashboard generator.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| programming_examples/matrix_vector_multiplication/int4_awq/mv_int4_bf16.cc | Adds packed int4-AWQ GEMM kernel entry point and GEMM-oriented zeroing helper alongside existing GEMV micro-kernels. |
| programming_examples/matrix_multiplication/int4_awq/matmul_int4_packed.py | Introduces a standalone AIR/XRT builder for packed int4-AWQ GEMM using the shared kernel object. |
| programming_examples/matrix_multiplication/int4_awq/Makefile | Builds/runs the GEMM example while reusing mv_int4_bf16.cc from the matvec example directory. |
| programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_small_peano.lit | Adds a small NPU2 smoke test for the packed GEMM flow. |
| programming_examples/matrix_multiplication/int4_awq/run_packed_npu2_llama_qproj_seq32_peano.lit | Adds an NPU2 correctness test based on Llama Q-proj prefill sizing. |
| programming_examples/generate_readme.py | Registers the new GEMM int4-AWQ example for the programming examples dashboard. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- matmul_int4_packed.py: assert kernel-side static_assert constraints (GS % 32 == 0, M_TILE*N_TILE % 32 == 0) at module-build time so unsupported tilings fail with a Python message instead of a C++ template/static_assert error from the kernel build. - mv_int4_bf16.cc: restructure mm_int4_bf16_impl loop nest to (mi, g, i, n) so each activation load a_vec(mi, g, i) is reused across all n_tile output columns instead of being reloaded per n. Per-(g, n) zero-point broadcasts and per-(mi, n) accumulators are hoisted out of the i loop too. Inner hot path stays load-packed + unpack + sub + cvt + mac. Acceptance unchanged on NPU2: - Smoke M=32 K=128 N=64: corr 0.999997 - Llama Q-proj seq=32 M=32 K=2048 N=2048: corr 0.999985 - GEMV regression M=2048 K=2048: corr 0.999997 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the 1D-herd-over-N design with a 2D herd over (M, N) where each PE accumulates K serially into a per-PE row-major bf16 L1 C. Drains via a 4D L2 C [herd_m, herd_n, tile_m, tile_n] with per-PE dst_offsets. Verified PASS at herd 1x1, 2x4, 8x4 (smoke shapes) and at Llama-3.2-1B Q-projection shape M=N=K=2048 herd 8x4 (correlation 0.999986). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the per-(mi, g, i, n) aie::mac inner loop with a two-phase kernel: (1) pack A row-major → mmul-packed [KB][MB][r][s] and dequant W (scale folded in) into mmul-packed [NB][KB][s][t]; (2) aie::mmul<8,8,8, bf16,bf16,accfloat> loop. Same kernel symbol and ABI, drop-in. Measured at M=N=K=2048 herd 8x4: 198 ms → 117 ms (1.7x). bf16 MMUL ceiling at the same shape is 51 ms; remaining gap is dominated by the scatter store in the dequant phase (32 scalar stores per (g, n, i) across a strided [s][t] inner layout). Correlation 0.999955+ at all tested shapes (1x1, 2x4, 8x4, Llama Q-proj). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
clang-format-17 on mv_int4_bf16.cc and black on matmul_int4_packed.py to fix the format CI check. Also surfaces the int4 GEMM kernel-side static_asserts (tile_m/n/k_l1 % 8 for mmul, gs % 32 for dequant, tile_m*tile_n % 32 for the zero kernel) as Python asserts at module- build time so unsupported tilings fail with a clear message rather than a C++ template/static_assert during compile-kernel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three issues addressed: 1. CI compile failure on amdhx370: the matmul template was instantiating with DIM_K=2048 when the host build only cared about DIM_K for the matvec — overflowing the AIE2P assembly printer's immediate range on the matmul's scratch buffer addressing. Added a separate DIM_K_CHUNK macro (default 128) used by mm_int4_bf16_mmul_impl, decoupled from matvec's DIM_K. matvec callers that only set DIM_K still build cleanly. 2. f32 accumulator across host K-chunk calls. matmul_int4_bf16_packed renamed to matmul_int4_bf16_packed_f32 and now takes float* c. New helpers zero_vectorized_f32_mn and f32_to_bf16_mn handle the L1 C init and the final bf16 narrowing once per launch. Host builder adds an f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel call between the K loop and the drain. 3. Post-MMUL scale fold. Dequant now produces UNSCALED bf16 W; per-group MMUL accumulates in f32; convert to f32 vec via row-by-row extract (the 64-element store_v gives a bad layout — using the same extract<t>(m_i) pattern as the working pre-MMUL kernel fixes it); scalar multiply by per-n bf16 scale (lifted to f32) and accumulate into the c tile. One bf16 truncate per output element per group instead of per W element — matches mac kernel's precision pattern. Correlation at Llama Q-proj seq=32 (M=32 K=N=2048): 0.999945 → 0.999975. Mismatch count (atol=0.05) dropped from ~6800 to ~11 / 65536 (0.017%). max_mismatch_percentage=0.05 in the host script bounds this at 32 elements with margin — correlation > 0.999 remains the primary check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the 64-iter scalar mul+add per (m_b, n_b) per group with a vector mul+add chain: bf16 scale broadcast → aie::mul(c_g_bf16, scale_tile) → f32 accum → row-by-row add into c_acc_buf. The bf16 mul is supported on aie2p (f32 vector mul isn't) and produces an f32 accumulator with no extra truncate. Restores throughput to 117 ms / 147 GOPS at M=N=K=2048 herd 8x4 — matches the prior (precision-broken) MMUL kernel's speed AND keeps the post-MMUL fold + f32 c accumulator precision pattern. Net result vs the rejected mac kernel: 1.7x faster (198 → 117 ms) AND more precise (correlation 0.999974 vs ~0.99995, mismatches stay tiny). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The int4 GEMM micro-kernel previously used 1x1 mmul (single accumulator chain) with a 32-scalar-store dequant scatter — both vs the 2x2 mmul expansion + contiguous-store layout that the bf16 baseline uses. At Q-proj shape (M=N=K=2048, herd 8x4) the kernel ran 117 ms while bf16 ran 61 ms despite having less weight bytes to move. Changes: - 2x2 mmul: 4 independent accumulators C00/C01/C10/C11, A and B vector reuse 2x per inner kg iter, chess_prepare_for_pipelining hint - Dequant b_pack layout swapped to [NB][KB][t][s] (n_i outer, k_i inner) so 8 k_i values land contiguously per (n_i, k_b) — replaces 32 scalar stores per inner iter with 4 vector stores - aie::transpose(B, t, s) at mmul load flips back to the mmul-expected [s][t] order in-register, avoiding any host-side per-tile Q repack (keeps pack_inputs consistent with the canonical AWQ tile layout) Result at Q-proj M=N=K=2048, herd 8x4: 117 ms → 39.5 ms (2.96x). Now 1.55x faster than bf16 baseline (61 ms) at the same shape. Smoke (M=64 K=128 N=128, herd 2x4) still PASS at corr 0.999974. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2x2-expansion static_assert in mm_int4_bf16_mmul_impl requires m_tile and n_tile to be multiples of 16. Matvec lit tests build the same .cc with DIM_M=8, which doesn't link any matmul symbol but still instantiates the matmul template via matmul_int4_bf16_packed_f32 and trips the assert. Guard the matmul entry + helpers with #if DIM_M >= 16 && DIM_N >= 16 so matvec builds skip the matmul template instantiation. Confirmed locally: matvec.o (DIM_M=8) builds cleanly with no matmul symbols; matmul.o (DIM_M=16) builds cleanly with full symbol set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an int4-AWQ GEMM (
matmul_int4_bf16_packed_f32) alongside the existing matvec entries inmv_int4_bf16.cc, plus aprogramming_examples/matrix_multiplication/int4_awq/host builder + Makefile + lit tests.The kernel:
aie::mmul<8,8,8,bf16,bf16,accfloat>produces an f32 tile per group.mul → f32 accum, no extra truncate), accumulate into the persistent f32 c. One bf16 truncate per output element per group instead of per W element — matches the mac-kernel precision pattern.f32_to_bf16_mnconverts the L1 C tile to bf16 for the drain. C never round-trips through bf16 mid-accumulation.Same packed Q+S+Z BO layout as the GEMV — one
mv_int4_bf16.oserves both decode (GEMV) and prefill (GEMM).Performance (NPU2, M=N=K=2048, herd 8x4)
aie::macaie::mmul+ post-MMUL fold + f32 c (this PR)aie::mmul(ceiling reference)Remaining gap to the bf16 ceiling: the dequant scatter store (R K-values for fixed N scatter across R/s k-blocks at stride t in the [s][t] inner layout). A vectorized scatter — processing multiple N's in parallel per k-block to enable contiguous stores — could close most of the remaining 2.3x; deferred until a prefill-bound workload demands it.
Host builder design
A 2D herd over (M, N). Each PE owns a (tile_m, tile_n) C-tile and serially accumulates K into a per-PE f32 L1 C accumulator, plus a separate bf16 L1 C drain buffer that the post-K-loop convert kernel writes into. The launch space tiles (M, N) across launches; the K-L2 loop is collapsed to 1 iter (
TILE_K_L2 = K) so the f32 accumulator survives across all K_CHUNK steps. Drain DMAs the bf16 buffer into a 4D L2 C[herd_m, herd_n, tile_m, tile_n]using per-PEdst_offsets=[_tx, _ty, 0, 0].This replaces an earlier 1D-herd-over-N design that put M on the launch axis and hit a correctness cliff at
M_div >= 4(compiler-sideair-opt-shim-dma-bdsunrolls non-zero-stride launch-axis iterations into per-iter shim BDs — fine atM_div=2, silent wrong at 4-5, HW hang at 6-8, BD-pool exhaustion at >=16). The 2D-herd structure sidesteps that path entirely.Files
mv_int4_bf16.ccmm_int4_bf16_mmul_impl<float>,matmul_int4_bf16_packed_f32,zero_vectorized_bf16_mn,zero_vectorized_f32_mn,f32_to_bf16_mn. NewDIM_K_CHUNKmacro (default 128) decouples the matmul's per-call k_chunk from the matvec's fullDIM_K. Existing GEMV symbols and ABI unchanged.matmul_int4_packed.py@launch(sizes=[M/(tile_m*herd_m), N/(tile_n*herd_n)])+@segment(L2 staging) +@herd(sizes=[herd_m, herd_n])with per-PE f32 L1 C accumulator + bf16 L1 C drain buffer + convert kernel call between the K loop and the drain.Makefilemv_int4_bf16.o(passesDIM_K_CHUNK=$(TILE_K_L1)). Targets:run_packed,run1x1,run2x4,run8x4,run_llama_qproj.run_packed_npu2_small_peano.litrun_packed_npu2_llama_qproj_seq32_peano.litNotable:
zero_vectorized_bf16_mnis not just a size variantPeano on AIE2P auto-vectorizes a scalar
for c[i]=0loop with a stride-4 store that silently skips every 4th element once the buffer is >= one full vector wide (32 bf16). The existing GEMV'szero_impl<DIM_M=8>stays under that threshold so it was unaffected, but the larger GEMM C tile (e.g. 16x16 = 256 bf16) triggers the bug. Fix: explicitaie::store_v(aie::zeros<bfloat16, 32>()). The existingzero_impl<m>is left untouched. The newzero_vectorized_f32_mnfollows the same explicit-vector-store pattern.Validation (NPU2)
matvec_int4_packedlit)max_mismatch_percentage=0.05in the host script bounds the small bf16-noise-floor mismatches at large K to ≤0.05% of output elements;min_correlation=0.999remains the primary correctness check.Test plan
litinvocation ofrun_packed_npu2_small_peano.lit— PASSlitinvocation ofrun_packed_npu2_llama_qproj_seq32_peano.lit— PASSmatrix_vector_multiplication/int4_awq/run_packed_*lits still PASS (shared kernel; GEMV symbols/ABI unchanged)Known limitations
TILE_K_L2 = Krequired — the f32 L1 C accumulator lives only for one herd invocation and the segment-level K-L2 loop would re-zero it. For K=2048 herd 8x4, L2 A = 8 * 16 * 2048 * 2 = 512 KB — fits one aie2p memtile. To scale K beyond a single memtile's capacity, the K-L2 loop will need to move inside the herd (channels-based feed), or the multi-herd zero+compute+drain pattern needs to work withfunc.call(currently theair-dependencypass doesn't track CallOp writes throughmemref.subview, so a herd-wide L1 C with per-PE subviews can't be used today).Dequant scatter store is the perf floor — closes 1.7x of the 4x gap to bf16 MMUL. A vectorized scatter (parallel-N dequant for contiguous stores per k-block) could push closer to the ceiling; deferred until prefill is the bottleneck.
Known follow-ups