Skip to content

[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA#746

Draft
hughperkins wants to merge 5 commits into
mainfrom
hp/amdgpu-no-permlane64-on-cdna
Draft

[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA#746
hughperkins wants to merge 5 commits into
mainfrom
hp/amdgpu-no-permlane64-on-cdna

Conversation

@hughperkins

Copy link
Copy Markdown
Collaborator

v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and
gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the
llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On
gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it
selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and
then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the
branch-relaxation pass. This made scene.build() segfault for any kernel using a
cross-half subgroup shuffle (genesis-world #2962).

Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x
RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces
correct cross-half results on wave64 hardware. Also drop the
QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the
related comments in llvm_context.cpp, runtime.cpp and test_simt.py.Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and
gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the
llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On
gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it
selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and
then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the
branch-relaxation pass. This made scene.build() segfault for any kernel using a
cross-half subgroup shuffle (genesis-world #2962).

Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x
RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces
correct cross-half results on wave64 hardware. Also drop the
QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the
related comments in llvm_context.cpp, runtime.cpp and test_simt.py.
Gating permlane64 off CDNA (prev commit) stops the SIGSEGV but exposed a
latent correctness bug: the cross-half shuffle helper is RDNA-shaped. It masks
ds_bpermute's target to 31 and relies on permlane64 to fetch the top half,
which is correct only where ds_bpermute is SIMD32-scoped (RDNA). On CDNA
ds_bpermute is wave64-wide, so masking to 31 means every lane reads a
bottom-half lane and the top half is never reached -- shuffle_xor(v,32)
returned [32..63, 32..63] instead of [32..63, 0..31] on gfx942 (CI never
caught this: RDNA uses the native instruction, not this path).

Make the lowering architecture-aware via two JIT-patched knobs:
- amdgpu_ds_bpermute_lane_mask(): 63 on GCN/CDNA (gfx9xx, wave64-wide
  ds_bpermute), 31 on RDNA (SIMD32-scoped). With mask 63 a single wide
  ds_bpermute already returns lane target_lane for the whole wave.
- permlane64 patched to the identity on CDNA, so the helper's cross-SIMD
  branch equals the same-SIMD branch and the per-lane select is a true no-op
  (and the intrinsic, which has no MC opcode on CDNA, is never emitted).

RDNA paths are unchanged: gfx11/gfx12 keep native v_permlane64_b32, gfx10.x
keeps the LDS-roundtrip emulation, both with lane mask 31.
The subgroup docs described emitting v_permlane64_b32 on CDNA as
"well-defined and free", which is exactly the pre-fix behavior that
crashed the AMDGPU backend (genesis-world #2962). Document the actual
per-arch lowering: a single wave-wide ds_bpermute on CDNA (no permlane64)
and the permlane64 + ds_bpermute + select pairing on RDNA wave64.
@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant