Skip to content

[AIE2P] Implement sparse fifo_ld + sparse extract/concat/insert#970

Open
matteius wants to merge 6 commits into
Xilinx:aie-publicfrom
opensensor:feature/aie2p-sparse-fifo-ld-intrinsics
Open

[AIE2P] Implement sparse fifo_ld + sparse extract/concat/insert#970
matteius wants to merge 6 commits into
Xilinx:aie-publicfrom
opensensor:feature/aie2p-sparse-fifo-ld-intrinsics

Conversation

@matteius
Copy link
Copy Markdown

Summary

Two commits filling in the AIE2P sparse-vector toolchain gap:

  1. [AIE2P][Headers] Implement sparse extract/concat/insert via field access — header-only.
    Defines the AIE2P-larger sparse vector types (v512uint4_sparse, v256uint8_sparse, v128uint16_sparse, v512int4_sparse, v256int8_sparse, v128int16_sparse) as composite structs holding lo + hi halves of the AIEv2-sized (640-bit) sparse vectors. Mirrors the v128bfp16ebs* pattern at aiebase_typedefs.h:563-576.
    Implements the existing extract_sparse_data / extract_v* / concat / set_v* / insert forward-declared surface using struct field access (return v.data;) instead of __builtin_aiev2_ext_qx (which is supplied by Vitis Chess but not by upstream Peano).
    Removes the conflicting empty-stub struct v256int8_sparse {}; family from aie2p_aie_api_compat.h.
    No clang/LLVM rebuild required.

  2. [AIE2P] Implement v128int8_sparse fifo_ld_pop L3-L5 backend chain — full backend addition.
    Adds the missing 640-bit sparse FIFO load intrinsic so v128int8_sparse fifo_ld_pop lowers to the existing silicon ops vlda.pop.640 / vldb.pop.640 (already wired into VLD_POP_640_normal_pop_pseudo at AIE2PMultiSlotPseudoInstrInfo.td:109).
    Layers added (mirrors the BFP16 multi-output shape):

    • L3 — clang builtin __builtin_aie2p_fifo_ld_pop_640_unaligned_sparse with sig vv*&V32i&i&V64c&V16c& (ptr-ref, fifo-state, pos, data-out, sparsity-mask-out).
    • L4 — LLVM IR intrinsic int_aie2p_fifo_ld_pop_640_unaligned_sparse returning the 5-tuple [anyptr, v32i32, i32, v64i8, v16i8].
    • L5selectVLD_FIFO_POP_640_SPARSE: allocates a virtual mQXsa (640-bit) reg, builds VLD_POP_640_normal_pop_pseudo, splits the 640-bit dst back into data (sub_sparse_x) + mask (sub_sparse_q) via a new buildAndConstrainSparseFifoLoadCopies helper (mirrors buildAndConstrainFifoLoadCopies / sub_bfp16_*). Registered as memory-touching in getTgtMemIntrinsic and as a FIFO-reg user in isUsedAsFifoRegInIntrinsic + getUnderlyingObjectAIEIntrinsic.
    • L2FIFO_LD_SPARSE / FIFO_LD_SPARSE_WIDE macros in aie2p_ldst.h (mirrors FIFO_LD_BFP16*); the wide form composes two narrow pops into v256int8_sparse via the composite-struct helpers from commit 1.
    • typedef — adds v16char (16-byte char vector) to aiebase_typedefs.h to match the V16c builtin signature for the mask out-ref.

Net diff: 12 files, 468 insertions(+), 9 deletions(-).

Motivation

Compiling AIE2P kernels that consume aie::sparse_vector_input_buffer_stream<int8, 256>::pop against upstream Peano produced four undefined symbols. Two were closable with header-only fixes (commit 1). The remaining two (fifo_ld_pop, fifo_ld_fill on v256int8_sparse_unaligned) require silicon-load semantics — i.e. the clang builtin → LLVM intrinsic → SelectionDAG patterns mapping to the existing VLDA_POP_640_* silicon ops (commit 2).

Symbol-count progression on the same microtest:

  • baseline: {fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse, extract_sparse_data, __muldi3} (5)
  • after commit 1: {fifo_ld_pop, fifo_ld_fill, __muldi3} (3)
  • after commit 2: {__muldi3} (1, libgcc)

Test plan

  • Tier 1: test_sparse_intrinsic.ll lowers via llc to vldb.pop.640 cleanly; test_sparse_builtin.cc emits @llvm.aie2p.fifo.ld.pop.640.unaligned.sparse in clang -emit-llvm output; test_v128int8_narrow.cc compiles to a single vldb.pop.640 instruction.
  • Tier 2: existing microtest passthrough_decompress.cc compiled with the new toolchain links with zero sparse undefined symbols; .o contains two vldb.pop.640 instructions (one per half of v256int8_sparse).
  • Tier 4 (silicon): not yet run — opening as draft for design review while we get hardware time. Will promote out of draft once a Strix dev kit confirms the lowered pop produces correct data + mask values.

Marked draft because of the silicon gap. Happy to split commits, restructure, or change naming if vldb vs vlda convention or the sub_sparse_x / sub_sparse_q sub-register names diverge from upstream preferences.

Matt Davis (Followup H) added 2 commits April 25, 2026 23:05
Followup H — closes 2 of 4 undefined symbols for G-T3.6-003
(state/followup-d/aiecc-link-error-step3.log).

Resolves the link errors for `aie::sparse_vector<int8, 256>::extract_data`
and `aie::sparse_vector_input_buffer_stream<int8, 256>::pop` (the partial
extract path) on AIE2P, by implementing the AIE-API-shaped surface in
pure header code.

Two changes:

1. `aiebase_typedefs.h` (new code under #if __AIEARCH__ == 21)
   Define the AIE2P-larger sparse vector types
   (v512uint4_sparse, v256uint8_sparse, v128uint16_sparse,
   v512int4_sparse, v256int8_sparse, v128int16_sparse) as composite
   structs holding `lo` + `hi` halves of the AIEv2-sized (640-bit)
   sparse vectors. Mirrors the v128bfp16ebs16 / v128bfp16ebs8 pattern
   (lines 563-576). Previously these were empty-stub structs in
   aie2p_aie_api_compat.h:53-66 (now removed) which made every
   forward-decl that took/returned them unimplementable.

2. `aie2p_upd_ext.h` (new code at tail)
   Implement the existing forward-decl surface:
   - extract_sparse_data(v128int8_sparse) -> v64int8 etc:
     mirror aiev2_upd_ext.h:2602-2622 but use struct field access
     (`return v.data;`) instead of __builtin_aiev2_ext_qx (which is
     not defined in upstream Peano — supplied by Vitis Chess).
   - extract_v* synonyms covering the same family.
   - extract_sparsity returning v.mask.
   - extract_v128int8_sparse(v256int8_sparse, int) etc:
     extracts via lo/hi field access on the new composite types.
   - concat / set_v* / insert overloads building larger from smaller.

3. `aie2p_aie_api_compat.h` cleanup
   Remove the stub `struct v256int8_sparse {};` family that previously
   shadowed the now-real composite types from aiebase_typedefs.h.

Verified by recompiling the Followup D microtest's
passthrough_decompress.cc against the modified headers (installed over
the wheel install at $PEANO_INSTALL_DIR). The undefined-symbol set
shrinks from {fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse,
extract_sparse_data} to {fifo_ld_pop, fifo_ld_fill}. The remaining two
require silicon-load semantics (new clang builtin + LLVM intrinsic +
SelectionDAG patterns mapping to VLDA_POP_640_*) and are out of scope
for header-only work.

No clang or LLVM rebuild required — pure header changes.
Adds the missing 640-bit sparse FIFO load intrinsic so that
v128int8_sparse fifo_ld_pop can lower to the existing silicon
ops vlda.pop.640 / vldb.pop.640 (already wired into
VLD_POP_640_normal_pop_pseudo at AIE2PMultiSlotPseudoInstrInfo.td:109).

Followup H closed 2 of 4 microtest sparse symbols
(extract_sparse_data, extract_v128int8_sparse) via header-only fixes.
The remaining 2 (fifo_ld_pop / fifo_ld_fill on v256int8_sparse_unaligned)
required this silicon-load chain. Once the narrow v128int8_sparse case
is in place at L3-L5, the wide v256int8_sparse case is a header-only
composition via Followup H's set_v256int8_sparse + insert overloads.

Layers added (mirrors the BFP16 multi-output shape in lines/code):

 L3 — clang frontend builtin
      __builtin_aie2p_fifo_ld_pop_640_unaligned_sparse with signature
      "vv*&V32i&i&V64c&V16c&" (void; ptr-ref + fifo-state + pos +
      data-out + sparsity-mask-out, all by reference).
        clang/include/clang/Basic/BuiltinsAIE2P.def
        clang/lib/CodeGen/CGBuiltin.cpp (3 sites: dispatch table +
          AIE-style EmitAIEBuiltinExpr + MXStructCount=2 case in the
          BFP16-style multi-output handler).

 L4 — LLVM IR intrinsic
      int_aie2p_fifo_ld_pop_640_unaligned_sparse, returning
      [llvm_anyptr_ty, llvm_v32i32_ty, llvm_i32_ty, llvm_v64i8_ty,
       llvm_v16i8_ty] from inputs [llvm_anyptr_ty, llvm_v32i32_ty,
       llvm_i32_ty]. v16i8 (128 bits) holds the sparsity_t mask.
        llvm/include/llvm/IR/IntrinsicsAIE2P.td

 L5 — SelectionDAG / GISel lowering
      New selector selectVLD_FIFO_POP_640_SPARSE allocates a virtual
      mQXsa (640-bit) register, builds VLD_POP_640_normal_pop_pseudo,
      then splits the 640-bit dst back into data (sub_sparse_x) +
      mask (sub_sparse_q) via the new buildAndConstrainSparseFifoLoadCopies
      helper (mirrors buildAndConstrainFifoLoadCopies / sub_bfp16_*).
      Also registers the intrinsic as memory-touching in
      getTgtMemIntrinsic and as a FIFO-reg user in
      isUsedAsFifoRegInIntrinsic + the ValueTracking
      getUnderlyingObjectAIEIntrinsic alias-analysis switch.
        llvm/lib/Target/AIE/aie2p/AIE2PInstructionSelector.cpp
        llvm/lib/Target/AIE/aie2p/AIE2PISelLowering.cpp
        llvm/lib/Target/AIE/aie2p/AIE2PRegisterBankInfo.cpp
        llvm/lib/Analysis/ValueTracking.cpp

 L2 — header macro instantiation
      New FIFO_LD_SPARSE macro (mirrors FIFO_LD_BFP16) instantiates
      fifo_ld_reset/fill/pop for v128int8_sparse, calling the new
      builtin with (v64char&)r.data + (v16char&)r.mask casts.
      New FIFO_LD_SPARSE_WIDE macro (mirrors FIFO_LD_BFP16_WIDE)
      composes two narrow pops into v256int8_sparse via Followup H's
      set_v256int8_sparse + insert overloads. Both registered in the
      master FIFO_LD macro.
        clang/lib/Headers/aie2p/aie2p_ldst.h

 typedef — added v16char (16-byte char vector) to aiebase_typedefs.h
      to match the V16c builtin signature for the mask out-ref. Used
      only by the new FIFO_LD_SPARSE macro's reinterpret cast on
      r.mask (which has storage type sparsity_t = unsigned _BitInt(128),
      same 128-bit width).
        clang/lib/Headers/aiebase_typedefs.h

Tier 1 validation (this commit):
  test_sparse_intrinsic.ll lowers via llc to vldb.pop.640 cleanly.
  test_sparse_builtin.cc emits @llvm.aie2p.fifo.ld.pop.640.unaligned.sparse
    in clang -emit-llvm output.
  test_v128int8_narrow.cc compiles to a single vldb.pop.640 instruction.

Tier 2 validation (this commit):
  Followup D microtest passthrough_decompress.cc compiled with
  the new toolchain links with ZERO sparse undefined symbols.
  Only __muldi3 (libgcc) remains. The .o contains TWO vldb.pop.640
  instructions (one for each half of v256int8_sparse).

Symbol count progression on Followup D microtest:
  Followup D baseline: {fifo_ld_pop, fifo_ld_fill, extract_v128int8_sparse,
                        extract_sparse_data, __muldi3}                  (5)
  After Followup H:    {fifo_ld_pop, fifo_ld_fill, __muldi3}            (3)
  After Followup I:    {__muldi3}                                       (1)

Tier 4 silicon validation: NOT YET RUN.
@konstantinschwarz
Copy link
Copy Markdown
Collaborator

hi @matteius, thanks a lot for this contribution!

I didn't go through all details yet, but one observation upfront: could you please add tests for both the frontend lowering to IR (should go to clang/test/CodeGen/aie/aie2p), as well as instruction selection tests (in llvm/test/CodeGen/AIE/aie2p/GlobalISel)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants