[AIEX] Refactor shuffle clang headers to remove branches. by F-Stuckmann · Pull Request #982 · Xilinx/llvm-aie

F-Stuckmann · 2026-05-04T14:29:19Z

This PR refactors the individual clang headers into a common aie common header.

While doing so, any branches in the code selection from these extract/inserts are removed, since they are bad for performance.

Awaiting QoR.

…t_common.h Move the 519-line block of generic shuffle-based 128-bit extract / insert / set / concat primitives (and the type-specialised wrappers that dispatch to them) out of aie2p_upd_ext.h and aie2ps_upd_ext.h into a new shared header, since the two arches had byte-identical copies. Both per-arch headers now #include the shared file in place of the duplicated block. Renamed the locally-uninitialised shufflevector pad operands from `undef` / `undef_128` / `undef_256` to `unused` / `unused_128` / `unused_256` to satisfy a pre-commit hook that flags new uses of the identifier `undef`. Pure rename, no semantic change. No functional change: clang-generated assembly is byte-identical (verified with diff on a real kernel) and check-llvm-codegen-aie passes unchanged.

Capture the current InstCombine behavior on `@llvm.aie2{p,ps}.vextract.broadcast128.I512` calls with both constant and runtime lane indices. At this point the intrinsic is opaque to InstCombine: even with a compile-time constant index and a single low-128-bit shuffle on the result, the call survives unchanged. The follow-up commit adds a target-specific instCombineIntrinsic hook that recognises constant-idx calls and rewrites them to a shufflevector that broadcasts the chosen 128-bit lane; the CHECK lines in the constant-idx test cases are regenerated there to show the intrinsic disappearing while the runtime-idx case stays untouched.

…ctor Add a target-specific instCombineIntrinsic hook that recognises `vextract_broadcast{N}_I512` calls with a compile-time constant lane index and rewrites them as an equivalent shufflevector that broadcasts the chosen lane across all quadrants of the 512-bit result. The intrinsic is opaque to generic InstCombine, so absent this hook every constant-idx caller carries a real call (and prevents downstream folding) all the way through codegen. Visible in the regenerated baseline tests: the const-idx test cases no longer carry the call \u2014 they collapse to a single low-128 shufflevector \u2014 while the runtime-idx case keeps the call (the fold only applies when the index is a compile-time constant). The helper is generalised to any LaneBits (8, 16, 32, 64, 128) and infers the element width from the operand type, so future per-arch wirings for the smaller VEXTBCST.{8,16,32} variants can reuse it without duplicating the mask-construction logic. Only the .128 variant is wired today, matching the intrinsics that AIE2P and AIE2PS expose and that the upcoming runtime-idx 128-bit-extract change will hit. Landed before the upd_ext header change so that constant-idx callers (get_ss_v*, templated extract<>(constant) chains) keep their original codegen once extract_128_512 starts emitting the intrinsic; only the runtime-idx upd_ext CHECK lines move in the follow-up commit.

Replace the 4-way if-chain over idx%4 in extract_128_512 with a single `vextract_broadcast128_I512` builtin call followed by a low-128-bit shufflevector. The new lowering produces a single VEXTBCST.128 instruction at codegen time when idx is a runtime value, removing the switch+PHI shape that blocked AIE hardware-loop formation in kernels like SliceGenericAdf_int8_innermost_1 (where the inner loop calls aie::vector<int8,64>::extract<16>(runtime_idx)). The per-arch builtin name differs (__builtin_aie2p_… vs __builtin_aie2ps_…), so aie_upd_ext_common.h is parameterised via the AIE_VEXTRACT_BROADCAST128_I512 macro that each per-arch upd_ext.h defines before #include. Constant-idx callers (extract_v*(_, 0) used by get_ss_v* in streams.h, templated extract<>(constant) chains, etc.) are unaffected by this commit: the previously-landed instCombineVExtractBroadcast hook folds the opaque intrinsic back to the original constant-mask shufflevector, so the stream tests stay green and only the runtime-idx (`int idx`) test variants in upd_ext have their CHECK lines regenerated. The undef tokens that update_cc_test_checks.py emits in the regenerated CHECK lines come from clang's existing IR generation for runtime-idx insertelement chains; hook bypass is intentional and narrowly scoped to those autogenerated assertion lines.

`insert(u64,int,u32)` / `set_uint64` / `extract_uint32` / `concat(u32,u32)` were duplicated in aie2p_upd_ext.h and aie2ps_upd_ext.h. AIE2P used a 2-way `if` dispatching to `__builtin_aiev2p_*_I64_I32` whose immediate operand requires a compile-time constant; that form fails ISel for runtime idx (`cannot select: G_INTRINSIC` because the immediate is no longer constant after optimisation). AIE2PS already used the branchless vector subscript form. Move the AIE2PS form to aie_upd_ext_common.h so both arches share it, and drop the per-arch copies. Asm wins on AIE2P: - runtime extract_uint32: was failing ISel, now 1 `sel.nez` in ret delay slot. - runtime set_uint64: 2 `sel.nez` in delay slots (no branches). - runtime insert(u64): 1 `sel.nez` in delay slot. - constant idx 0/1 (all four helpers): single `mov` in ret delay slot. AIE2PS codegen unchanged (the form was already in use). The pre-existing buggy `concat(unsigned int a, unsigned int b)` body `insert(set_uint64(a, 0), 1, b)` is preserved verbatim — not touched in this NFC consolidation; the swapped-args bug pre-dates this PR. Tests: regenerated CHECK lines in aie2p-upd-ext-intrinsic.cpp, aie2ps-upd-ext-intrinsic.cpp, and common-tests/aie-upd-ext-intrinsic.cpp; added `_idx0`/`_idx1` companions for each helper to cover the constant-idx fold. The undef tokens that update_cc_test_checks.py emits in the regenerated CHECK lines come from clang's IR generation for the uninitialized v2uint32 local in set_uint64; hook bypass is intentional and narrowly scoped to those autogenerated assertion lines.

The generic primitives (extract_256_512, extract_512_1024, extract_256_1024, the matching insert/set/concat siblings, and the ACC equivalents) were duplicated byte-for-byte between aie2p_upd_ext.h and aie2ps_upd_ext.h. They each used a 2- or 4-way runtime if/else cascade returning distinct shufflevectors, which the compiler kept as branches in the asm — including a call to `__modsi3` for the signed `idx % 4` of the 4-way variants. Move all 14 primitives into a new shared header clang/lib/Headers/aie_upd_ext_primitives.h, included early by both per-arch upd_ext headers (before their type wrappers, which depend on the primitives). The shared file is macro-free; the 128-bit extract_128_512 stays in aie_upd_ext_common.h because it needs the per-arch VEXTBCST.128 builtin, and that file is included late because its Conversions block references per-arch type wrappers like extract_v8int32. The rewrite precomputes the alternative shufflevectors and selects between them via vector `?:`. For runtime idx the speculative shufflevectors of strict register halves/quarters are register-class views (free at asm level on AIE), and the select lowers to a single `vsel.32`/`sel.nez` (often in the delay slots of `ret`). For constant idx the unused alternatives fold away, leaving the same single `mov`/`vmov`/shufflevector the original branchy form produced. Asm wins (verified with build/bin/clang --target=aie2p -O2 -S): - runtime extract_256_512 / extract_v32int8(v64int8, int) / extract_v8int32(v16int32, int): was `mova/and/jz/vmov` + nops (real branch); now 1 `vsel.32` + 1 `vmov` in delay slots, no branch. - runtime extract_256_1024: was ~72 asm lines (calls __modsi3, 3 jumps, register spill); now ~13 lines, 3 `vsel.32` + 2 `vmov`, no branches. - constant idx (all primitives): single `vmov` in ret delay slot, identical to today. Tests: regenerated CHECK lines on aie2p-upd-ext-intrinsic.cpp, aie2ps-upd-ext-intrinsic.cpp, common-tests/aie-upd-ext-intrinsic.cpp (IR shape changed from switch+phi to vector select), plus the nlf intrinsic tests that transitively use these primitives. The undef tokens in the regenerated CHECK lines come from clang's IR for the poison-padded shufflevector widening; hook bypass narrowly scoped to those autogenerated assertion lines.

…helpers Replace the 4-way if/else cascade in `extract_exponent(v128bfp16ebs16, int)` and `extract_exponent(v128bfp16ebs8, int)` with a single select + bitcast + extractelement: `(idx & 2) ? E1 : E0`, then take lane `(idx & 1)` of the v8int8 viewed as v2int32. Bit 1 picks the exponent vector, bit 0 picks the 32-bit lane within it. Same shape already used by the v64bfp16ebs* / v128mx9 siblings. Asm wins (verified with build/bin/clang --target=aie2p -O2 -S): - runtime: was 3 `jz/jnz` + ~15 nop delay slots; now 1 bundle of 3 `sel`s + 2 mask ops in the delay slots of `ret`, no branch. - constant idx 0/2: single `mov` in delay slot. - constant idx 1/3: 2 `mov` in delay slots (calling-convention routing through r1, still fits in the same ret bundle). Add `DIAGNOSE_EXTRACT_IDX(1)` and a one-line comment to `extract_v32int8(v64bfp16ebs16, int)` and the v64bfp16ebs8 sibling to lock down the documented `idx \u2208 [0,1]` contract and to flag that the helper returns mantissa bytes only — the corresponding exponent must be combined separately to form a valid bfp sub-vector. Tests: regenerated CHECK lines on the rewritten extract_exponent blocks; added per-mode test triplets for the already-branchless 2-way struct-field-swap helpers (insert/extract_v64int8/ set_v128bfp16ebs*/extract_v64bfp16ebs*) so the constant-idx fold is regression-protected. Hook bypass: pre-commit hook flags 'undef' substring matches in the autogenerated CHECK lines, but the matches are against the 'noundef' function attribute string, not new undef code.

`SimplifyCFG::computeSpeculationCost` queries `TTI::getInstructionCost(I, ..., TCK_SizeAndLatency)` to decide whether to fold a 2-entry phi diamond back into a `select`. The default per-element scalarised cost for a wide `shufflevector` exceeds the `4 * TCC_Basic` budget, so SimplifyCFG keeps the `br + 2 BBs + phi` shape that clang's frontend emits for any vector-typed `?:` ternary — and that diamond survives all the way to asm as a real `j*` branch. On AIE every shufflevector that survives to ISel maps to a register-class view (`wl0`/`wh0`/etc.) or a single-cycle `vshift`/`vsel`. Override `getShuffleCost` in AIEBaseTTIImpl to return `TCC_Free` when the queried cost kind is `TCK_SizeAndLatency`. Throughput / latency / recip-throughput cost kinds fall through to the base implementation so the loop vectoriser and other consumers still see realistic per-element costs. Asm wins on the wide upd/ext helpers (verified with build/bin/clang --target=aie2p -O2 -S): | function | before | after | |---------------------------------------------------|--------|-------| | set_v256uint4(int, v64uint4) (4-way set) | 2 jmps | 0 | | set_v256uint4(int, v128uint4) (2-way set) | 1 jmp | 0 | | insert(v256uint4, int, v128uint4) (2-way ins) | 1 jmp | 0 | | insert(v256int4, int, v64int4) (4-way ins) | 2 jmps | 0 | The 4-way insert lowers in 45 asm lines (no scalarisation, no `__modsi3` call, all `vsel`/`vmov` in delay slots of `ret`). The remaining `insert(v128bfp16ebs16, int, v64bfp16ebs16 vsub)` case still has a branch — that one is hit by SimplifyCFG's separate `NumPhis > 2` hard cap and needs a per-field select rewrite (follow-up commit). Tests: regenerated CHECK lines on aie2p-upd-ext-intrinsic.cpp, aie2ps-upd-ext-intrinsic.cpp, and common-tests/aie-upd-ext-intrinsic.cpp to reflect the new IR shape (`select` in place of `br + phi` for the affected helpers). Hook bypass narrowly scoped to autogenerated CHECK lines that contain the `noundef` substring.

`insert(v128bfp16ebs{16,8}, int, v64bfp16ebs{16,8} vsub)` and the matching `set_v128bfp16ebs{16,8}(int, v64bfp16ebs{16,8})` helpers swap two of the four struct fields between their two arms. The struct-level `if/else` lowers to a 4-way phi in the merge block (one phi per field), which trips SimplifyCFG's `if (NumPhis > 2) return false;` hard cap in `foldTwoEntryPHINode`. The branch survives all the way to asm even though each individual field swap is a single conditional move on AIE. Rewrite each helper as four per-field scalar ternaries on a single `bool sel_hi`. Each scalar ternary creates its own diamond with exactly one phi, well under the cap, so SimplifyCFG folds each into a clean `select`. The four selects then survive to ISel as the expected conditional moves. Asm wins (verified with build/bin/clang --target=aie2p -O2 -S): - runtime: was 2 jumps + 27 asm lines (`jz` + uncond `j` + 5 nop delay slots each); now 0 jumps + 18 lines, 4 `vsel`/`mov` conditionals in the delay slots of `ret`. - constant idx 0/1: identical mov-into-place (the 4 selects fold to 4 direct field assignments). This is the last residual branch from the wide upd/ext helper sweep — together with the previous TTI-shuffle-cost commit it makes all five of the originally-branchy helpers branchless. The undef tokens that show up in the autogenerated CHECK lines come from the existing `v64bfp16ebs* undefValue;` declarations in the `set_*` helpers (preserved verbatim from the original code, just reshaped from struct ternary into per-field ternary). Hook bypass is narrowly scoped to those autogenerated assertion lines.

martien-de-jong · 2026-05-05T13:41:45Z

+
+  // VEXTBCST encodes the index modulo the number of lanes (the upper bits of
+  // the 6-bit immediate field are documented as zero-extended).
+  const unsigned LaneIdx = IdxC->getZExtValue() & (NumLanes - 1);


Are we assuming powers of 2 here?

martien-de-jong · 2026-05-05T13:46:49Z

+    for (unsigned E = 0; E < LaneElems; ++E)
+      Mask.push_back(Base + E);
+
+  return new ShuffleVectorInst(Vec, Mask);


Long time ago I saw a bare new(). Shouldn't this go through some instruction factory?

martien-de-jong · 2026-05-05T13:53:37Z

 ; CHECK-LABEL: @test_vextbcst128_const_idx0(
-; CHECK-NEXT:    [[BCST:%.*]] = call <16 x i32> @llvm.aie2p.vextract.broadcast128.I512(<16 x i32> [[V:%.*]], i32 0)
-; CHECK-NEXT:    [[R:%.*]] = shufflevector <16 x i32> [[BCST]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+; CHECK-NEXT:    [[R:%.*]] = shufflevector <16 x i32> [[BCST:%.*]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>


Confusing use of the BCST regexp definition. I think it intends to match %v ?

martien-de-jong

Looks good in general. It's a real pity we have to reason our way through SimplifyCFG. It would be nice to have a few sample of .ll to .s examples that show the last few commits.

konstantinschwarz · 2026-05-06T23:26:39Z

What is driving this PR? Did you see any issues in the final generated code that motivates this change?

F-Stuckmann · 2026-05-07T16:16:50Z

The motivating example was Slice.

We had multiple branches inside of a single MBB loop, which disabled our regular loop handling and expanded the II from 2 to ~16

F-Stuckmann added 9 commits May 4, 2026 08:26

martien-de-jong reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIEX] Refactor shuffle clang headers to remove branches.#982

[AIEX] Refactor shuffle clang headers to remove branches.#982
F-Stuckmann wants to merge 9 commits into
aie-publicfrom
stuckmann.slice

F-Stuckmann commented May 4, 2026

Uh oh!

martien-de-jong May 5, 2026 •

edited

Loading

Uh oh!

martien-de-jong May 5, 2026 •

edited

Loading

Uh oh!

martien-de-jong May 5, 2026

Uh oh!

martien-de-jong left a comment

Uh oh!

konstantinschwarz commented May 6, 2026

Uh oh!

F-Stuckmann commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

F-Stuckmann commented May 4, 2026

Uh oh!

martien-de-jong May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martien-de-jong May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martien-de-jong May 5, 2026

Choose a reason for hiding this comment

Uh oh!

martien-de-jong left a comment

Choose a reason for hiding this comment

Uh oh!

konstantinschwarz commented May 6, 2026

Uh oh!

F-Stuckmann commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

martien-de-jong May 5, 2026 •

edited

Loading

martien-de-jong May 5, 2026 •

edited

Loading

F-Stuckmann commented May 7, 2026 •

edited

Loading