Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions docs/source/user_guide/subgroup.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,14 +110,14 @@ Each lane returns the `value` held by the lane whose subgroup-local id equals `i
Lane `i` returns the `value` held by lane `i + offset`. Lanes near the top of the subgroup - where `i + offset >= subgroup_size` - receive an implementation-defined value (typically their own `value`), so reduction patterns must only trust lane 0's final result, or mask out the out-of-range lanes.

- `value` and `offset` dtypes: same as `shuffle` above; `offset` is a `u32`.
- Maps to `__shfl_down_sync` on CUDA and `OpGroupNonUniformShuffleDown` on SPIR-V. On AMDGPU it is emulated with `ds_bpermute`; wave64 cross-half offsets (any `offset >= 32` for low-half lanes, or any non-zero `offset` for high-half lanes that lands across the SIMD32 boundary) go through the same `permlane64 + ds_bpermute + select` lowering as `shuffle` - see [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering). These operations are added on both RDNA and CDNA.
- Maps to `__shfl_down_sync` on CUDA and `OpGroupNonUniformShuffleDown` on SPIR-V. On AMDGPU it is emulated with `ds_bpermute`; wave64 cross-half offsets (any `offset >= 32` for low-half lanes, or any non-zero `offset` for high-half lanes that lands across the SIMD32 boundary) go through the same wave64 cross-half lowering as `shuffle` - a single wave-wide `ds_bpermute` on CDNA, or a `permlane64 + ds_bpermute + select` sequence on RDNA - see [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering). These operations are added on both RDNA and CDNA.

### `shuffle_up(value, offset)`

Lane `i` returns the `value` held by lane `i - offset`. Lanes near the bottom of the subgroup - where `i - offset < 0` - receive an implementation-defined value (typically their own `value`), so the bottom `offset` lanes' results should be ignored or masked.

- Same dtype rules as `shuffle` / `shuffle_down`; `offset` is a `u32`.
- Maps to `__shfl_up_sync` on CUDA and `OpGroupNonUniformShuffleUp` on SPIR-V. On AMDGPU it is emulated with `ds_bpermute((lane - offset) * 4, value)`; wave64 cross-half cases go through the [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering) (same `permlane64 + ds_bpermute + select` sequence as `shuffle` / `shuffle_down`). These operations are added on both RDNA and CDNA.
- Maps to `__shfl_up_sync` on CUDA and `OpGroupNonUniformShuffleUp` on SPIR-V. On AMDGPU it is emulated with `ds_bpermute((lane - offset) * 4, value)`; wave64 cross-half cases go through the [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering) (same per-arch handling as `shuffle` / `shuffle_down`: a single wave-wide `ds_bpermute` on CDNA, a `permlane64 + ds_bpermute + select` sequence on RDNA). These operations are added on both RDNA and CDNA.

### `shuffle_xor(value, mask)`

Expand All @@ -132,7 +132,7 @@ Lane `i` returns the `value` held by lane `i ^ mask`. Convenient for butterfly p
Every lane in the subgroup returns the `value` held by the lane whose subgroup-local id equals `index`. Expresses intent ("read lane `index`") more directly than `shuffle(value, index)` and on backends with a dedicated broadcast may map to a cheaper instruction.

- Same dtype rules as `shuffle`.
- Maps to `__shfl_sync` on CUDA, `ds_bpermute` (plus a `permlane64`-driven cross-half select on wave64) on AMDGPU, and `OpGroupNonUniformBroadcast` on SPIR-V. See [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering) for the wave64 mechanics. These operations are added on both RDNA and CDNA.
- Maps to `__shfl_sync` on CUDA, `ds_bpermute` (with a `permlane64`-driven cross-half select on RDNA wave64, or a single wave-wide `ds_bpermute` on CDNA wave64) on AMDGPU, and `OpGroupNonUniformBroadcast` on SPIR-V. See [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering) for the wave64 mechanics. These operations are added on both RDNA and CDNA.
- **Important: on SPIR-V, `index` must be dynamically uniform** - the same value on every lane in the subgroup. Passing a per-lane varying `index` is undefined behavior, because `OpGroupNonUniformBroadcast` requires its `Id` operand to be dynamically uniform across the subgroup. On CUDA / AMDGPU, `index` may vary per lane and the call is identical to `shuffle(value, index)`. If you need a varying source lane, use `shuffle` directly.

### `broadcast_first(value)`
Expand Down Expand Up @@ -565,9 +565,10 @@ After the call, lane `k` (within each group of 32) holds `a[group_start] + a[gro

### AMDGPU wave64 cross-half lowering

AMDGPU `ds_bpermute_b32` - the LDS-routed permute that Quadrants uses to lower `shuffle`, `shuffle_down`, and `shuffle_up` - has a hardware quirk on RDNA (gfx10/11/12, e.g. RX 7900 XTX): its lane-id operand is **SIMD32-scoped**. On a wave64 RDNA wave the 64 lanes execute as two SIMD32 clusters; `ds_bpermute` on those chips can only address lanes inside the requesting lane's own SIMD32 half. CDNA (gfx9xx, MI200/MI300) keeps the wave on a single SIMD64, so `ds_bpermute` there is wave-wide and the quirk does not exist.
AMDGPU `ds_bpermute_b32` - the LDS-routed permute that Quadrants uses to lower `shuffle`, `shuffle_down`, and `shuffle_up` - reaches a different set of lanes depending on the architecture, so Quadrants lowers cross-half wave64 shuffles two different ways:

To make wave64 `shuffle` / `shuffle_down` / `shuffle_up` behave consistently across RDNA and CDNA, Quadrants always lowers cross-half-capable shuffles through this 3-op sequence:
- **CDNA (gfx9xx, MI200 / MI300)**: the wave runs as a single SIMD64, so `ds_bpermute_b32` is wave-wide and addresses all 64 lanes directly. A cross-half shuffle is therefore a single `ds_bpermute_b32 (target_lane * 4), value` with the lane id masked to 6 bits - no `permlane64` involved. Quadrants must **not** emit `v_permlane64_b32` here: the instruction does not exist on any CDNA part, and feeding `llvm.amdgcn.permlane64` to the backend on gfx9xx makes it select a pseudo with no valid CDNA machine opcode and then crash during branch relaxation (genesis-world issue #2962).
- **RDNA (gfx10/11/12, e.g. RX 7900 XTX)**: the 64 lanes execute as two SIMD32 clusters and `ds_bpermute_b32`'s lane-id operand is **SIMD32-scoped** - it can only address lanes inside the requesting lane's own 32-lane half. To reach the other half, Quadrants pairs `ds_bpermute` with a half-swap through this 3-op sequence:

```
swapped = v_permlane64_b32 value # swap the two SIMD32 halves of the wave
Expand All @@ -576,14 +577,14 @@ hi = ds_bpermute_b32 (lane*4), swapped
result = ((target_lane ^ self_lane) & 32) ? hi : lo
```

The two `ds_bpermute_b32` reads run in parallel - one reads the original payload (correct when target is in the same SIMD32 half), the other reads the `permlane64`-swapped payload (correct when the target is in the other half) - and a per-lane select picks between them based on whether the target crosses the 32-lane boundary. On CDNA the cross-half branch is dead, but the cost is one extra `v_permlane64_b32` (still well-defined and free) and one `v_cndmask_b32` - no measurable hit. On RDNA wave64 this is the only correct lowering.
The two `ds_bpermute_b32` reads run in parallel - one reads the original payload (correct when the target is in the same SIMD32 half), the other reads the `permlane64`-swapped payload (correct when the target is in the other half) - and a per-lane select picks between them based on whether the target crosses the 32-lane boundary. The half-swap is a single `v_permlane64_b32` on gfx11 / gfx12 (RDNA3 / RDNA4); on gfx10.x (RDNA1/2), which predates the instruction, it is emulated with a wave-local LDS round-trip. (The same per-lane select runs on CDNA too, but with `permlane64` reduced to a no-op the two reads collapse to one wave-wide `ds_bpermute` and the select becomes dead.)

One subtlety worth knowing about (mostly for anyone reading the generated IR): the lane-id operand to `ds_bpermute` is wrapped in an empty `+v` inline-asm fence inside the runtime helper. Without that fence, LLVM's AMDGPU backend can decide a compile-time-constant or otherwise uniform lane-id is "uniform across the wave" and silently lower the call to a `v_readlane_b32`-style instruction that addresses lanes 0..31 **wave-globally** rather than SIMD32-locally. That would break cross-half shuffles whose target lane is a literal (`broadcast(v, 47)`, `shuffle(v, qd.u32(40))`, etc.). The fence costs zero - same instruction shape on every path - and pins the lowering to a real `ds_bpermute_b32` so the SIMD-local semantics our `permlane64` pairing relies on always hold.
One subtlety worth knowing about (mostly for anyone reading the generated IR): the lane-id operand to `ds_bpermute` is wrapped in an empty `+v` inline-asm fence inside the runtime helper. Without that fence, LLVM's AMDGPU backend can decide a compile-time-constant or otherwise uniform lane-id is "uniform across the wave" and silently lower the call to a `v_readlane_b32`-style instruction that addresses lanes 0..31 **wave-globally** rather than per the real `ds_bpermute_b32` lane semantics. That would break cross-half shuffles whose target lane is a literal (`broadcast(v, 47)`, `shuffle(v, qd.u32(40))`, etc.) on both ISAs. The fence costs zero - same instruction shape on every path - and pins the lowering to a real `ds_bpermute_b32` so the lane addressing the shuffle lowering relies on always holds.

## Performance notes

- Shuffles are register-to-register on CUDA (`__shfl_sync`, `__shfl_down_sync`, `__shfl_up_sync`) and on SPIR-V where the GPU has hardware support - typically a handful of cycles, no memory traffic.
- AMDGPU `shuffle`, `shuffle_down`, and `shuffle_up` all go through `ds_permute` / `ds_bpermute` (LDS-routed, roughly tens of cycles). On wave64 the lowering issues two parallel `ds_bpermute_b32` reads plus a `v_permlane64_b32` swap and a per-lane select to handle cross-half shuffles correctly on RDNA - see [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering). The two `ds_bpermute` reads issue in parallel, so the latency is the same as a single read; the `permlane64` and `cndmask` add a few extra cycles.
- AMDGPU `shuffle`, `shuffle_down`, and `shuffle_up` all go through `ds_permute` / `ds_bpermute` (LDS-routed, roughly tens of cycles). On CDNA wave64 a cross-half shuffle is a single wave-wide `ds_bpermute_b32`; on RDNA wave64 the lowering issues two parallel `ds_bpermute_b32` reads plus a `v_permlane64_b32` swap and a per-lane select to reach across the SIMD32 boundary - see [AMDGPU wave64 cross-half lowering](#amdgpu-wave64-cross-half-lowering). The two `ds_bpermute` reads issue in parallel, so the latency is the same as a single read; on RDNA the `permlane64` and `cndmask` add a few extra cycles.
- `shuffle_xor` and `broadcast_first` are `@qd.func` wrappers over `shuffle` / `broadcast` and inline at compile time, so on every backend they cost exactly the same as the underlying op.
- Both `ballot_first_n` and `ballot` lower to a single hardware instruction on every backend - one cycle on CUDA (`__ballot_sync`), one instruction on AMDGPU (a single `v_cmp_*_e64` populating the wavefront-width SETCC, then a low-half store for `ballot_first_n`), and `OpGroupNonUniformBallot` on SPIR-V (extract one or two components of the result `uvec4`). At `n == 32` `ballot_first_n` elides the predicate-masking step entirely; at `n < 32` it inserts one extra multiply on the predicate.
- `reduce_add` and `reduce_all_add` both issue exactly `log2_group_size()` shuffles and `log2_group_size()` adds per call (5 on wave32, 6 on AMDGPU wave64). No barriers, no shared memory, no launch overhead (they inline). The same holds for the `_tiled` form at any `log2_size`.
Expand Down
85 changes: 54 additions & 31 deletions quadrants/runtime/llvm/llvm_context.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -552,43 +552,66 @@ std::unique_ptr<llvm::Module> QuadrantsLLVMContext::module_from_file(const std::
}
patch_intrinsic("amdgpu_clock_i64", llvm::Intrinsic::amdgcn_s_memtime);
patch_intrinsic("amdgpu_ds_bpermute", llvm::Intrinsic::amdgcn_ds_bpermute);
// ``llvm.amdgcn.permlane64`` exchanges a 32-bit value between lanes ``i`` and ``i ^ 32`` in a single instruction.
// We use it to extend the SIMD32-scoped ``ds_bpermute`` (every shuffle op lowers to that) into a wave64-aware
// cross-half shuffle on RDNA: ``ds_bpermute`` reads within the lane's own 32-lane SIMD cluster, ``permlane64``
// brings the other SIMD's value to this lane, and we select between the two based on which half the target lane
// sits in. See ``amdgpu_cross_half_shuffle_i32`` in runtime.cpp. The instruction is gfx940+ (CDNA3) and gfx11+
// (RDNA3+) only -- on earlier wave64-capable targets (gfx9xx CDNA1/2, gfx10.x RDNA1/2) the AMDGPU LLVM backend
// hits "Cannot select" while lowering the intrinsic, so we have to provide a software emulation.
// The wave64 cross-half subgroup shuffle (``amdgpu_cross_half_shuffle_i32`` in runtime.cpp) is built from
// ``ds_bpermute`` plus, on some targets, ``permlane64``. How those behave -- and therefore how we patch them --
// depends on the architecture family:
//
// The emulation is a wave-local LDS roundtrip: each lane writes its ``value`` to ``lds[wave_base + lane]``,
// a wavefront-scope acquire-release fence lowers to ``s_waitcnt lgkmcnt(0)`` (drains outstanding LDS writes),
// and each lane then reads ``lds[wave_base + (lane ^ 32)]``. On RDNA wave64-emulation the two SIMD32 halves of
// the wave issue store / load in two passes apiece, but the waitcnt between them guarantees both halves' stores
// are committed to LDS before either half's loads issue, so the cross-half routing is correct. ``wave_base``
// is ``(workitem.id.x >> 6) << 6``, scoping the LDS slot to a single wave so multi-wave workgroups don't
// collide. The LDS buffer is a 1024-entry per-workgroup global (4 KiB) -- enough for the AMDGPU 1024-thread
// workgroup max at wave64. The buffer is only materialised on this code path, so kernels on permlane64-capable
// hardware (the common case) pay zero LDS for cross-half shuffles.
// * GCN / CDNA (gfx9xx: gfx900/906 Vega, gfx908/gfx90a CDNA1/2, gfx940/gfx941/gfx942 CDNA3, gfx950 CDNA4):
// ``ds_bpermute`` addresses the full wave64 directly, so the cross-half shuffle is a single wide
// ``ds_bpermute`` (lane mask 63) and ``permlane64`` is unnecessary. Critically, ``v_permlane64_b32`` does
// not exist on CDNA -- emitting ``llvm.amdgcn.permlane64`` makes the backend select the ``V_PERMLANE64_B32``
// pseudo, which has no valid MC opcode for CDNA, and then crash with a bare SIGSEGV inside
// ``SIInstrInfo::getInstSizeInBytes`` during branch relaxation (genesis-world #2962). So on CDNA we patch
// ``amdgpu_permlane64`` to the identity, which neutralises the helper's (RDNA-shaped) cross-SIMD branch
// without ever emitting the intrinsic.
// * RDNA3/4 (gfx11 / gfx12): ``ds_bpermute`` is SIMD32-scoped (lane mask 31), so the top half is reached via
// the native single-instruction ``v_permlane64_b32``.
// * RDNA1/2 (gfx10.x): ``ds_bpermute`` is SIMD32-scoped (lane mask 31), but ``v_permlane64_b32`` does not
// exist yet, so we emulate the lane ``i`` <-> ``i ^ 32`` swap with an LDS roundtrip (below).
//
// The intrinsic is overloaded on its element type (signature ``T -> T`` for any 32-bit-or-smaller ``T``), so we
// have to pass the explicit ``i32`` type alongside the ID -- otherwise ``CreateIntrinsic`` segfaults inside
// ``getDeclaration()`` while resolving the mangled name.
// The LDS emulation writes each lane's ``value`` to ``lds[wave_base + lane]``, issues a wavefront-scope
// acquire-release fence (lowers to ``s_waitcnt lgkmcnt(0)``: drains outstanding LDS writes without the
// cross-wave ``s_barrier`` a workgroup-scope fence would emit, which would deadlock if only some waves reach
// this point), then reads back ``lds[wave_base + (lane ^ 32)]``. ``wave_base`` is ``(workitem.id.x >> 6) << 6``,
// scoping the slot to a single wave so multi-wave workgroups don't collide. The buffer is a 1024-entry
// per-workgroup global (4 KiB, the AMDGPU 1024-thread wave64 max), materialised only on this path, so kernels
// on the other two paths pay zero LDS for cross-half shuffles.
//
// ``patch_intrinsic`` for permlane64 passes the explicit ``i32`` type alongside the ID because the intrinsic is
// overloaded on its element type (signature ``T -> T`` for any 32-bit-or-smaller ``T``); otherwise
// ``CreateIntrinsic`` segfaults inside ``getDeclaration()`` while resolving the mangled name.
auto mcpu_str = AMDGPUContext::get_instance().get_mcpu();
bool has_permlane64 = (mcpu_str == "gfx940" || mcpu_str == "gfx941" || mcpu_str == "gfx942" ||
mcpu_str.substr(0, 5) == "gfx11" || mcpu_str.substr(0, 5) == "gfx12");
// Escape hatch for validating the LDS software emulation on hardware that natively supports
// ``v_permlane64_b32``: setting ``QD_AMDGPU_FORCE_PERMLANE64_FALLBACK=1`` forces the JIT to take the LDS path
// even on gfx11+ / gfx940+, so we can exercise the fallback on a working AMD box (gfx1100 / gfx942) without
// needing a gfx10.x runner. Has no effect on non-AMDGPU backends.
if (const char *force_fallback = std::getenv("QD_AMDGPU_FORCE_PERMLANE64_FALLBACK")) {
if (force_fallback[0] == '1') {
has_permlane64 = false;
}
bool is_gcn_cdna = (mcpu_str.substr(0, 4) == "gfx9");
bool has_native_permlane64 = (mcpu_str.substr(0, 5) == "gfx11" || mcpu_str.substr(0, 5) == "gfx12");

// Patch the ds_bpermute lane mask used by ``amdgpu_cross_half_shuffle_i32``: 63 where ``ds_bpermute`` is
// wave64-wide (GCN/CDNA), 31 where it is SIMD32-scoped (RDNA, paired with the permlane64 swap above).
if (auto mask_func = module->getFunction("amdgpu_ds_bpermute_lane_mask")) {
mask_func->deleteBody();
auto bb = llvm::BasicBlock::Create(*ctx, "entry", mask_func);
IRBuilder<> builder(*ctx);
builder.SetInsertPoint(bb);
builder.CreateRet(llvm::ConstantInt::get(llvm::Type::getInt32Ty(*ctx), is_gcn_cdna ? 63 : 31));
QuadrantsLLVMContext::mark_inline(mask_func);
}
if (has_permlane64) {

if (has_native_permlane64) {
patch_intrinsic("amdgpu_permlane64", llvm::Intrinsic::amdgcn_permlane64, true, {llvm::Type::getInt32Ty(*ctx)});
} else if (is_gcn_cdna) {
// CDNA: the wide ds_bpermute already reaches all 64 lanes, so permlane64 is unnecessary -- and emitting it
// crashes the backend. Patch it to the identity so the helper's cross-SIMD branch returns the same value as
// its same-SIMD branch, making the per-lane select a true no-op.
if (auto permlane64_func = module->getFunction("amdgpu_permlane64")) {
permlane64_func->deleteBody();
auto bb = llvm::BasicBlock::Create(*ctx, "entry", permlane64_func);
IRBuilder<> builder(*ctx);
builder.SetInsertPoint(bb);
builder.CreateRet(&*permlane64_func->arg_begin());
QuadrantsLLVMContext::mark_inline(permlane64_func);
}
} else if (auto permlane64_func = module->getFunction("amdgpu_permlane64")) {
// LDS-based software emulation. Layout: ``[1024 x i32] addrspace(3)`` indexed by ``wave_base + lane``.
// gfx10.x RDNA1/2: LDS-based software emulation. Layout: ``[1024 x i32] addrspace(3)`` indexed by ``wave_base +
// lane``.
auto i32_ty = llvm::Type::getInt32Ty(*ctx);
auto buf_ty = llvm::ArrayType::get(i32_ty, 1024);
auto lds_global = llvm::cast_or_null<llvm::GlobalVariable>(module->getNamedValue("__amdgpu_permlane64_lds"));
Expand Down
Loading
Loading