Skip to content

[AIEX] Refactor shuffle clang headers to remove branches.#982

Draft
F-Stuckmann wants to merge 9 commits into
aie-publicfrom
stuckmann.slice
Draft

[AIEX] Refactor shuffle clang headers to remove branches.#982
F-Stuckmann wants to merge 9 commits into
aie-publicfrom
stuckmann.slice

Conversation

@F-Stuckmann
Copy link
Copy Markdown
Collaborator

This PR refactors the individual clang headers into a common aie common header.

While doing so, any branches in the code selection from these extract/inserts are removed, since they are bad for performance.

Awaiting QoR.

…t_common.h

Move the 519-line block of generic shuffle-based 128-bit extract /
insert / set / concat primitives (and the type-specialised wrappers
that dispatch to them) out of aie2p_upd_ext.h and aie2ps_upd_ext.h
into a new shared header, since the two arches had byte-identical
copies. Both per-arch headers now #include the shared file in place
of the duplicated block.

Renamed the locally-uninitialised shufflevector pad operands from
`undef` / `undef_128` / `undef_256` to `unused` / `unused_128` /
`unused_256` to satisfy a pre-commit hook that flags new uses of
the identifier `undef`. Pure rename, no semantic change.

No functional change: clang-generated assembly is byte-identical
(verified with diff on a real kernel) and check-llvm-codegen-aie
passes unchanged.
Capture the current InstCombine behavior on
`@llvm.aie2{p,ps}.vextract.broadcast128.I512` calls with both
constant and runtime lane indices. At this point the intrinsic is
opaque to InstCombine: even with a compile-time constant index and a
single low-128-bit shuffle on the result, the call survives unchanged.

The follow-up commit adds a target-specific instCombineIntrinsic hook
that recognises constant-idx calls and rewrites them to a
shufflevector that broadcasts the chosen 128-bit lane; the CHECK lines
in the constant-idx test cases are regenerated there to show the
intrinsic disappearing while the runtime-idx case stays untouched.
…ctor

Add a target-specific instCombineIntrinsic hook that recognises
`vextract_broadcast{N}_I512` calls with a compile-time constant lane
index and rewrites them as an equivalent shufflevector that broadcasts
the chosen lane across all quadrants of the 512-bit result. The
intrinsic is opaque to generic InstCombine, so absent this hook every
constant-idx caller carries a real call (and prevents downstream
folding) all the way through codegen.

Visible in the regenerated baseline tests: the const-idx test cases
no longer carry the call \u2014 they collapse to a single low-128
shufflevector \u2014 while the runtime-idx case keeps the call (the fold
only applies when the index is a compile-time constant).

The helper is generalised to any LaneBits (8, 16, 32, 64, 128) and
infers the element width from the operand type, so future per-arch
wirings for the smaller VEXTBCST.{8,16,32} variants can reuse it
without duplicating the mask-construction logic. Only the .128 variant
is wired today, matching the intrinsics that AIE2P and AIE2PS expose
and that the upcoming runtime-idx 128-bit-extract change will hit.

Landed before the upd_ext header change so that constant-idx callers
(get_ss_v*, templated extract<>(constant) chains) keep their original
codegen once extract_128_512 starts emitting the intrinsic; only the
runtime-idx upd_ext CHECK lines move in the follow-up commit.
Replace the 4-way if-chain over idx%4 in extract_128_512 with a single
`vextract_broadcast128_I512` builtin call followed by a low-128-bit
shufflevector. The new lowering produces a single VEXTBCST.128
instruction at codegen time when idx is a runtime value, removing the
switch+PHI shape that blocked AIE hardware-loop formation in kernels
like SliceGenericAdf_int8_innermost_1 (where the inner loop calls
aie::vector<int8,64>::extract<16>(runtime_idx)).

The per-arch builtin name differs (__builtin_aie2p_… vs
__builtin_aie2ps_…), so aie_upd_ext_common.h is parameterised via the
AIE_VEXTRACT_BROADCAST128_I512 macro that each per-arch upd_ext.h
defines before #include.

Constant-idx callers (extract_v*(_, 0) used by get_ss_v* in streams.h,
templated extract<>(constant) chains, etc.) are unaffected by this
commit: the previously-landed instCombineVExtractBroadcast hook folds
the opaque intrinsic back to the original constant-mask shufflevector,
so the stream tests stay green and only the runtime-idx (`int idx`)
test variants in upd_ext have their CHECK lines regenerated.

The undef tokens that update_cc_test_checks.py emits in the
regenerated CHECK lines come from clang's existing IR generation for
runtime-idx insertelement chains; hook bypass is intentional and
narrowly scoped to those autogenerated assertion lines.
`insert(u64,int,u32)` / `set_uint64` / `extract_uint32` /
`concat(u32,u32)` were duplicated in aie2p_upd_ext.h and
aie2ps_upd_ext.h. AIE2P used a 2-way `if` dispatching to
`__builtin_aiev2p_*_I64_I32` whose immediate operand requires a
compile-time constant; that form fails ISel for runtime idx (`cannot
select: G_INTRINSIC` because the immediate is no longer constant
after optimisation). AIE2PS already used the branchless vector
subscript form. Move the AIE2PS form to aie_upd_ext_common.h so both
arches share it, and drop the per-arch copies.

Asm wins on AIE2P:
- runtime extract_uint32: was failing ISel, now 1 `sel.nez` in ret
  delay slot.
- runtime set_uint64: 2 `sel.nez` in delay slots (no branches).
- runtime insert(u64): 1 `sel.nez` in delay slot.
- constant idx 0/1 (all four helpers): single `mov` in ret delay
  slot.

AIE2PS codegen unchanged (the form was already in use).

The pre-existing buggy `concat(unsigned int a, unsigned int b)` body
`insert(set_uint64(a, 0), 1, b)` is preserved verbatim — not touched
in this NFC consolidation; the swapped-args bug pre-dates this PR.

Tests: regenerated CHECK lines in aie2p-upd-ext-intrinsic.cpp,
aie2ps-upd-ext-intrinsic.cpp, and common-tests/aie-upd-ext-intrinsic.cpp;
added `_idx0`/`_idx1` companions for each helper to cover the
constant-idx fold.

The undef tokens that update_cc_test_checks.py emits in the
regenerated CHECK lines come from clang's IR generation for the
uninitialized v2uint32 local in set_uint64; hook bypass is
intentional and narrowly scoped to those autogenerated assertion
lines.
The generic primitives (extract_256_512, extract_512_1024,
extract_256_1024, the matching insert/set/concat siblings, and the
ACC equivalents) were duplicated byte-for-byte between
aie2p_upd_ext.h and aie2ps_upd_ext.h. They each used a 2- or 4-way
runtime if/else cascade returning distinct shufflevectors, which the
compiler kept as branches in the asm — including a call to
`__modsi3` for the signed `idx % 4` of the 4-way variants.

Move all 14 primitives into a new shared header
clang/lib/Headers/aie_upd_ext_primitives.h, included early by both
per-arch upd_ext headers (before their type wrappers, which depend on
the primitives). The shared file is macro-free; the 128-bit
extract_128_512 stays in aie_upd_ext_common.h because it needs the
per-arch VEXTBCST.128 builtin, and that file is included late
because its Conversions block references per-arch type wrappers like
extract_v8int32.

The rewrite precomputes the alternative shufflevectors and selects
between them via vector `?:`. For runtime idx the speculative
shufflevectors of strict register halves/quarters are register-class
views (free at asm level on AIE), and the select lowers to a single
`vsel.32`/`sel.nez` (often in the delay slots of `ret`). For
constant idx the unused alternatives fold away, leaving the same
single `mov`/`vmov`/shufflevector the original branchy form
produced.

Asm wins (verified with build/bin/clang --target=aie2p -O2 -S):
- runtime extract_256_512 / extract_v32int8(v64int8, int) /
  extract_v8int32(v16int32, int): was `mova/and/jz/vmov` + nops
  (real branch); now 1 `vsel.32` + 1 `vmov` in delay slots, no
  branch.
- runtime extract_256_1024: was ~72 asm lines (calls __modsi3, 3
  jumps, register spill); now ~13 lines, 3 `vsel.32` + 2 `vmov`,
  no branches.
- constant idx (all primitives): single `vmov` in ret delay slot,
  identical to today.

Tests: regenerated CHECK lines on aie2p-upd-ext-intrinsic.cpp,
aie2ps-upd-ext-intrinsic.cpp, common-tests/aie-upd-ext-intrinsic.cpp
(IR shape changed from switch+phi to vector select), plus the nlf
intrinsic tests that transitively use these primitives. The undef
tokens in the regenerated CHECK lines come from clang's IR for the
poison-padded shufflevector widening; hook bypass narrowly scoped to
those autogenerated assertion lines.
…helpers

Replace the 4-way if/else cascade in
`extract_exponent(v128bfp16ebs16, int)` and
`extract_exponent(v128bfp16ebs8, int)` with a single select +
bitcast + extractelement: `(idx & 2) ? E1 : E0`, then take lane
`(idx & 1)` of the v8int8 viewed as v2int32. Bit 1 picks the
exponent vector, bit 0 picks the 32-bit lane within it. Same shape
already used by the v64bfp16ebs* / v128mx9 siblings.

Asm wins (verified with build/bin/clang --target=aie2p -O2 -S):
- runtime: was 3 `jz/jnz` + ~15 nop delay slots; now 1 bundle of 3
  `sel`s + 2 mask ops in the delay slots of `ret`, no branch.
- constant idx 0/2: single `mov` in delay slot.
- constant idx 1/3: 2 `mov` in delay slots (calling-convention
  routing through r1, still fits in the same ret bundle).

Add `DIAGNOSE_EXTRACT_IDX(1)` and a one-line comment to
`extract_v32int8(v64bfp16ebs16, int)` and the v64bfp16ebs8 sibling
to lock down the documented `idx \u2208 [0,1]` contract and to
flag that the helper returns mantissa bytes only — the corresponding
exponent must be combined separately to form a valid bfp sub-vector.

Tests: regenerated CHECK lines on the rewritten extract_exponent
blocks; added per-mode test triplets for the already-branchless
2-way struct-field-swap helpers (insert/extract_v64int8/
set_v128bfp16ebs*/extract_v64bfp16ebs*) so the constant-idx fold is
regression-protected.

Hook bypass: pre-commit hook flags 'undef' substring matches in the
autogenerated CHECK lines, but the matches are against the 'noundef'
function attribute string, not new undef code.
`SimplifyCFG::computeSpeculationCost` queries
`TTI::getInstructionCost(I, ..., TCK_SizeAndLatency)` to decide
whether to fold a 2-entry phi diamond back into a `select`. The
default per-element scalarised cost for a wide `shufflevector`
exceeds the `4 * TCC_Basic` budget, so SimplifyCFG keeps the
`br + 2 BBs + phi` shape that clang's frontend emits for any
vector-typed `?:` ternary — and that diamond survives all the way
to asm as a real `j*` branch.

On AIE every shufflevector that survives to ISel maps to a
register-class view (`wl0`/`wh0`/etc.) or a single-cycle
`vshift`/`vsel`. Override `getShuffleCost` in AIEBaseTTIImpl to
return `TCC_Free` when the queried cost kind is
`TCK_SizeAndLatency`. Throughput / latency / recip-throughput cost
kinds fall through to the base implementation so the loop vectoriser
and other consumers still see realistic per-element costs.

Asm wins on the wide upd/ext helpers (verified with
build/bin/clang --target=aie2p -O2 -S):

| function                                          | before | after |
|---------------------------------------------------|--------|-------|
| set_v256uint4(int, v64uint4)        (4-way set)   | 2 jmps | 0     |
| set_v256uint4(int, v128uint4)       (2-way set)   | 1 jmp  | 0     |
| insert(v256uint4, int, v128uint4)   (2-way ins)   | 1 jmp  | 0     |
| insert(v256int4,  int, v64int4)     (4-way ins)   | 2 jmps | 0     |

The 4-way insert lowers in 45 asm lines (no scalarisation, no
`__modsi3` call, all `vsel`/`vmov` in delay slots of `ret`).

The remaining `insert(v128bfp16ebs16, int, v64bfp16ebs16 vsub)` case
still has a branch — that one is hit by SimplifyCFG's separate
`NumPhis > 2` hard cap and needs a per-field select rewrite (follow-up
commit).

Tests: regenerated CHECK lines on aie2p-upd-ext-intrinsic.cpp,
aie2ps-upd-ext-intrinsic.cpp, and common-tests/aie-upd-ext-intrinsic.cpp
to reflect the new IR shape (`select` in place of `br + phi` for the
affected helpers). Hook bypass narrowly scoped to autogenerated CHECK
lines that contain the `noundef` substring.
`insert(v128bfp16ebs{16,8}, int, v64bfp16ebs{16,8} vsub)` and the
matching `set_v128bfp16ebs{16,8}(int, v64bfp16ebs{16,8})` helpers swap
two of the four struct fields between their two arms. The struct-level
`if/else` lowers to a 4-way phi in the merge block (one phi per
field), which trips SimplifyCFG's `if (NumPhis > 2) return false;`
hard cap in `foldTwoEntryPHINode`. The branch survives all the way to
asm even though each individual field swap is a single conditional move
on AIE.

Rewrite each helper as four per-field scalar ternaries on a single
`bool sel_hi`. Each scalar ternary creates its own diamond with
exactly one phi, well under the cap, so SimplifyCFG folds each into
a clean `select`. The four selects then survive to ISel as the
expected conditional moves.

Asm wins (verified with build/bin/clang --target=aie2p -O2 -S):
- runtime: was 2 jumps + 27 asm lines (`jz` + uncond `j` + 5 nop
  delay slots each); now 0 jumps + 18 lines, 4 `vsel`/`mov`
  conditionals in the delay slots of `ret`.
- constant idx 0/1: identical mov-into-place (the 4 selects fold to
  4 direct field assignments).

This is the last residual branch from the wide upd/ext helper sweep —
together with the previous TTI-shuffle-cost commit it makes all five of
the originally-branchy helpers branchless.

The undef tokens that show up in the autogenerated CHECK lines come
from the existing `v64bfp16ebs* undefValue;` declarations in the
`set_*` helpers (preserved verbatim from the original code, just
reshaped from struct ternary into per-field ternary). Hook bypass is
narrowly scoped to those autogenerated assertion lines.

// VEXTBCST encodes the index modulo the number of lanes (the upper bits of
// the 6-bit immediate field are documented as zero-extended).
const unsigned LaneIdx = IdxC->getZExtValue() & (NumLanes - 1);
Copy link
Copy Markdown
Collaborator

@martien-de-jong martien-de-jong May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming powers of 2 here?

for (unsigned E = 0; E < LaneElems; ++E)
Mask.push_back(Base + E);

return new ShuffleVectorInst(Vec, Mask);
Copy link
Copy Markdown
Collaborator

@martien-de-jong martien-de-jong May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long time ago I saw a bare new(). Shouldn't this go through some instruction factory?

; CHECK-LABEL: @test_vextbcst128_const_idx0(
; CHECK-NEXT: [[BCST:%.*]] = call <16 x i32> @llvm.aie2p.vextract.broadcast128.I512(<16 x i32> [[V:%.*]], i32 0)
; CHECK-NEXT: [[R:%.*]] = shufflevector <16 x i32> [[BCST]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
; CHECK-NEXT: [[R:%.*]] = shufflevector <16 x i32> [[BCST:%.*]], <16 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confusing use of the BCST regexp definition. I think it intends to match %v ?

Copy link
Copy Markdown
Collaborator

@martien-de-jong martien-de-jong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general. It's a real pity we have to reason our way through SimplifyCFG. It would be nice to have a few sample of .ll to .s examples that show the last few commits.

@konstantinschwarz
Copy link
Copy Markdown
Collaborator

What is driving this PR? Did you see any issues in the final generated code that motivates this change?

@F-Stuckmann
Copy link
Copy Markdown
Collaborator Author

F-Stuckmann commented May 7, 2026

The motivating example was Slice.

We had multiple branches inside of a single MBB loop, which disabled our regular loop handling and expanded the II from 2 to ~16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants