[AIE2P] Implement sparse fifo_ld + sparse extract/concat/insert#970
Open
matteius wants to merge 6 commits into
Open
[AIE2P] Implement sparse fifo_ld + sparse extract/concat/insert#970matteius wants to merge 6 commits into
matteius wants to merge 6 commits into
Conversation
added 2 commits
April 25, 2026 23:05
Followup H — closes 2 of 4 undefined symbols for G-T3.6-003
(state/followup-d/aiecc-link-error-step3.log).
Resolves the link errors for `aie::sparse_vector<int8, 256>::extract_data`
and `aie::sparse_vector_input_buffer_stream<int8, 256>::pop` (the partial
extract path) on AIE2P, by implementing the AIE-API-shaped surface in
pure header code.
Two changes:
1. `aiebase_typedefs.h` (new code under #if __AIEARCH__ == 21)
Define the AIE2P-larger sparse vector types
(v512uint4_sparse, v256uint8_sparse, v128uint16_sparse,
v512int4_sparse, v256int8_sparse, v128int16_sparse) as composite
structs holding `lo` + `hi` halves of the AIEv2-sized (640-bit)
sparse vectors. Mirrors the v128bfp16ebs16 / v128bfp16ebs8 pattern
(lines 563-576). Previously these were empty-stub structs in
aie2p_aie_api_compat.h:53-66 (now removed) which made every
forward-decl that took/returned them unimplementable.
2. `aie2p_upd_ext.h` (new code at tail)
Implement the existing forward-decl surface:
- extract_sparse_data(v128int8_sparse) -> v64int8 etc:
mirror aiev2_upd_ext.h:2602-2622 but use struct field access
(`return v.data;`) instead of __builtin_aiev2_ext_qx (which is
not defined in upstream Peano — supplied by Vitis Chess).
- extract_v* synonyms covering the same family.
- extract_sparsity returning v.mask.
- extract_v128int8_sparse(v256int8_sparse, int) etc:
extracts via lo/hi field access on the new composite types.
- concat / set_v* / insert overloads building larger from smaller.
3. `aie2p_aie_api_compat.h` cleanup
Remove the stub `struct v256int8_sparse {};` family that previously
shadowed the now-real composite types from aiebase_typedefs.h.
Verified by recompiling the Followup D microtest's
passthrough_decompress.cc against the modified headers (installed over
the wheel install at $PEANO_INSTALL_DIR). The undefined-symbol set
shrinks from {fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse,
extract_sparse_data} to {fifo_ld_pop, fifo_ld_fill}. The remaining two
require silicon-load semantics (new clang builtin + LLVM intrinsic +
SelectionDAG patterns mapping to VLDA_POP_640_*) and are out of scope
for header-only work.
No clang or LLVM rebuild required — pure header changes.
Adds the missing 640-bit sparse FIFO load intrinsic so that
v128int8_sparse fifo_ld_pop can lower to the existing silicon
ops vlda.pop.640 / vldb.pop.640 (already wired into
VLD_POP_640_normal_pop_pseudo at AIE2PMultiSlotPseudoInstrInfo.td:109).
Followup H closed 2 of 4 microtest sparse symbols
(extract_sparse_data, extract_v128int8_sparse) via header-only fixes.
The remaining 2 (fifo_ld_pop / fifo_ld_fill on v256int8_sparse_unaligned)
required this silicon-load chain. Once the narrow v128int8_sparse case
is in place at L3-L5, the wide v256int8_sparse case is a header-only
composition via Followup H's set_v256int8_sparse + insert overloads.
Layers added (mirrors the BFP16 multi-output shape in lines/code):
L3 — clang frontend builtin
__builtin_aie2p_fifo_ld_pop_640_unaligned_sparse with signature
"vv*&V32i&i&V64c&V16c&" (void; ptr-ref + fifo-state + pos +
data-out + sparsity-mask-out, all by reference).
clang/include/clang/Basic/BuiltinsAIE2P.def
clang/lib/CodeGen/CGBuiltin.cpp (3 sites: dispatch table +
AIE-style EmitAIEBuiltinExpr + MXStructCount=2 case in the
BFP16-style multi-output handler).
L4 — LLVM IR intrinsic
int_aie2p_fifo_ld_pop_640_unaligned_sparse, returning
[llvm_anyptr_ty, llvm_v32i32_ty, llvm_i32_ty, llvm_v64i8_ty,
llvm_v16i8_ty] from inputs [llvm_anyptr_ty, llvm_v32i32_ty,
llvm_i32_ty]. v16i8 (128 bits) holds the sparsity_t mask.
llvm/include/llvm/IR/IntrinsicsAIE2P.td
L5 — SelectionDAG / GISel lowering
New selector selectVLD_FIFO_POP_640_SPARSE allocates a virtual
mQXsa (640-bit) register, builds VLD_POP_640_normal_pop_pseudo,
then splits the 640-bit dst back into data (sub_sparse_x) +
mask (sub_sparse_q) via the new buildAndConstrainSparseFifoLoadCopies
helper (mirrors buildAndConstrainFifoLoadCopies / sub_bfp16_*).
Also registers the intrinsic as memory-touching in
getTgtMemIntrinsic and as a FIFO-reg user in
isUsedAsFifoRegInIntrinsic + the ValueTracking
getUnderlyingObjectAIEIntrinsic alias-analysis switch.
llvm/lib/Target/AIE/aie2p/AIE2PInstructionSelector.cpp
llvm/lib/Target/AIE/aie2p/AIE2PISelLowering.cpp
llvm/lib/Target/AIE/aie2p/AIE2PRegisterBankInfo.cpp
llvm/lib/Analysis/ValueTracking.cpp
L2 — header macro instantiation
New FIFO_LD_SPARSE macro (mirrors FIFO_LD_BFP16) instantiates
fifo_ld_reset/fill/pop for v128int8_sparse, calling the new
builtin with (v64char&)r.data + (v16char&)r.mask casts.
New FIFO_LD_SPARSE_WIDE macro (mirrors FIFO_LD_BFP16_WIDE)
composes two narrow pops into v256int8_sparse via Followup H's
set_v256int8_sparse + insert overloads. Both registered in the
master FIFO_LD macro.
clang/lib/Headers/aie2p/aie2p_ldst.h
typedef — added v16char (16-byte char vector) to aiebase_typedefs.h
to match the V16c builtin signature for the mask out-ref. Used
only by the new FIFO_LD_SPARSE macro's reinterpret cast on
r.mask (which has storage type sparsity_t = unsigned _BitInt(128),
same 128-bit width).
clang/lib/Headers/aiebase_typedefs.h
Tier 1 validation (this commit):
test_sparse_intrinsic.ll lowers via llc to vldb.pop.640 cleanly.
test_sparse_builtin.cc emits @llvm.aie2p.fifo.ld.pop.640.unaligned.sparse
in clang -emit-llvm output.
test_v128int8_narrow.cc compiles to a single vldb.pop.640 instruction.
Tier 2 validation (this commit):
Followup D microtest passthrough_decompress.cc compiled with
the new toolchain links with ZERO sparse undefined symbols.
Only __muldi3 (libgcc) remains. The .o contains TWO vldb.pop.640
instructions (one for each half of v256int8_sparse).
Symbol count progression on Followup D microtest:
Followup D baseline: {fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse,
extract_sparse_data, __muldi3} (5)
After Followup H: {fifo_ld_pop, fifo_ld_fill, __muldi3} (3)
After Followup I: {__muldi3} (1)
Tier 4 silicon validation: NOT YET RUN.
Collaborator
|
hi @matteius, thanks a lot for this contribution! I didn't go through all details yet, but one observation upfront: could you please add tests for both the frontend lowering to IR (should go to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two commits filling in the AIE2P sparse-vector toolchain gap:
[AIE2P][Headers] Implement sparse extract/concat/insert via field access— header-only.Defines the AIE2P-larger sparse vector types (
v512uint4_sparse,v256uint8_sparse,v128uint16_sparse,v512int4_sparse,v256int8_sparse,v128int16_sparse) as composite structs holdinglo+hihalves of the AIEv2-sized (640-bit) sparse vectors. Mirrors thev128bfp16ebs*pattern ataiebase_typedefs.h:563-576.Implements the existing
extract_sparse_data/extract_v*/concat/set_v*/insertforward-declared surface using struct field access (return v.data;) instead of__builtin_aiev2_ext_qx(which is supplied by Vitis Chess but not by upstream Peano).Removes the conflicting empty-stub
struct v256int8_sparse {};family fromaie2p_aie_api_compat.h.No clang/LLVM rebuild required.
[AIE2P] Implement v128int8_sparse fifo_ld_pop L3-L5 backend chain— full backend addition.Adds the missing 640-bit sparse FIFO load intrinsic so
v128int8_sparse fifo_ld_poplowers to the existing silicon opsvlda.pop.640/vldb.pop.640(already wired intoVLD_POP_640_normal_pop_pseudoatAIE2PMultiSlotPseudoInstrInfo.td:109).Layers added (mirrors the BFP16 multi-output shape):
__builtin_aie2p_fifo_ld_pop_640_unaligned_sparsewith sigvv*&V32i&i&V64c&V16c&(ptr-ref, fifo-state, pos, data-out, sparsity-mask-out).int_aie2p_fifo_ld_pop_640_unaligned_sparsereturning the 5-tuple[anyptr, v32i32, i32, v64i8, v16i8].selectVLD_FIFO_POP_640_SPARSE: allocates a virtualmQXsa(640-bit) reg, buildsVLD_POP_640_normal_pop_pseudo, splits the 640-bit dst back into data (sub_sparse_x) + mask (sub_sparse_q) via a newbuildAndConstrainSparseFifoLoadCopieshelper (mirrorsbuildAndConstrainFifoLoadCopies/sub_bfp16_*). Registered as memory-touching ingetTgtMemIntrinsicand as a FIFO-reg user inisUsedAsFifoRegInIntrinsic+getUnderlyingObjectAIEIntrinsic.FIFO_LD_SPARSE/FIFO_LD_SPARSE_WIDEmacros inaie2p_ldst.h(mirrorsFIFO_LD_BFP16*); the wide form composes two narrow pops intov256int8_sparsevia the composite-struct helpers from commit 1.v16char(16-byte char vector) toaiebase_typedefs.hto match theV16cbuiltin signature for the mask out-ref.Net diff: 12 files, 468 insertions(+), 9 deletions(-).
Motivation
Compiling AIE2P kernels that consume
aie::sparse_vector_input_buffer_stream<int8, 256>::popagainst upstream Peano produced four undefined symbols. Two were closable with header-only fixes (commit 1). The remaining two (fifo_ld_pop,fifo_ld_fillonv256int8_sparse_unaligned) require silicon-load semantics — i.e. the clang builtin → LLVM intrinsic → SelectionDAG patterns mapping to the existingVLDA_POP_640_*silicon ops (commit 2).Symbol-count progression on the same microtest:
{fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse, extract_sparse_data, __muldi3}(5){fifo_ld_pop, fifo_ld_fill, __muldi3}(3){__muldi3}(1, libgcc)Test plan
test_sparse_intrinsic.lllowers viallctovldb.pop.640cleanly;test_sparse_builtin.ccemits@llvm.aie2p.fifo.ld.pop.640.unaligned.sparseinclang -emit-llvmoutput;test_v128int8_narrow.cccompiles to a singlevldb.pop.640instruction.passthrough_decompress.cccompiled with the new toolchain links with zero sparse undefined symbols;.ocontains twovldb.pop.640instructions (one per half ofv256int8_sparse).Marked draft because of the silicon gap. Happy to split commits, restructure, or change naming if
vldbvsvldaconvention or thesub_sparse_x/sub_sparse_qsub-register names diverge from upstream preferences.