[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA#746
Draft
hughperkins wants to merge 5 commits into
Draft
Conversation
v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the branch-relaxation pass. This made scene.build() segfault for any kernel using a cross-half subgroup shuffle (genesis-world #2962). Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces correct cross-half results on wave64 hardware. Also drop the QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the related comments in llvm_context.cpp, runtime.cpp and test_simt.py.
Gating permlane64 off CDNA (prev commit) stops the SIGSEGV but exposed a latent correctness bug: the cross-half shuffle helper is RDNA-shaped. It masks ds_bpermute's target to 31 and relies on permlane64 to fetch the top half, which is correct only where ds_bpermute is SIMD32-scoped (RDNA). On CDNA ds_bpermute is wave64-wide, so masking to 31 means every lane reads a bottom-half lane and the top half is never reached -- shuffle_xor(v,32) returned [32..63, 32..63] instead of [32..63, 0..31] on gfx942 (CI never caught this: RDNA uses the native instruction, not this path). Make the lowering architecture-aware via two JIT-patched knobs: - amdgpu_ds_bpermute_lane_mask(): 63 on GCN/CDNA (gfx9xx, wave64-wide ds_bpermute), 31 on RDNA (SIMD32-scoped). With mask 63 a single wide ds_bpermute already returns lane target_lane for the whole wave. - permlane64 patched to the identity on CDNA, so the helper's cross-SIMD branch equals the same-SIMD branch and the per-lane select is a true no-op (and the intrinsic, which has no MC opcode on CDNA, is never emitted). RDNA paths are unchanged: gfx11/gfx12 keep native v_permlane64_b32, gfx10.x keeps the LDS-roundtrip emulation, both with lane mask 31.
The subgroup docs described emitting v_permlane64_b32 on CDNA as "well-defined and free", which is exactly the pre-fix behavior that crashed the AMDGPU backend (genesis-world #2962). Document the actual per-arch lowering: a single wave-wide ds_bpermute on CDNA (no permlane64) and the permlane64 + ds_bpermute + select pairing on RDNA wave64.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and
gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the
llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On
gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it
selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and
then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the
branch-relaxation pass. This made scene.build() segfault for any kernel using a
cross-half subgroup shuffle (genesis-world #2962).
Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x
RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces
correct cross-half results on wave64 hardware. Also drop the
QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the
related comments in llvm_context.cpp, runtime.cpp and test_simt.py.Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough