[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA by hughperkins · Pull Request #746 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-06-17T21:05:06Z

v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and
gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the
llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On
gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it
selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and
then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the
branch-relaxation pass. This made scene.build() segfault for any kernel using a
cross-half subgroup shuffle (genesis-world #2962).

Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x
RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces
correct cross-half results on wave64 hardware. Also drop the
QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the
related comments in llvm_context.cpp, runtime.cpp and test_simt.py.Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

v_permlane64_b32 is an RDNA-only instruction: it exists on gfx11 (RDNA3) and gfx12 (RDNA4) but on no CDNA part. quadrants previously enabled the llvm.amdgcn.permlane64 intrinsic on gfx940/gfx941/gfx942 (CDNA3) as well. On gfx942 the AMDGPU backend does not cleanly "Cannot select" the intrinsic -- it selects the V_PERMLANE64_B32 pseudo, which has no valid MC opcode for CDNA, and then crashes with a bare SIGSEGV in SIInstrInfo::getInstSizeInBytes during the branch-relaxation pass. This made scene.build() segfault for any kernel using a cross-half subgroup shuffle (genesis-world #2962). Gate has_permlane64 to gfx11/gfx12 only, so every CDNA target (and gfx10.x RDNA1/2) takes the existing LDS-roundtrip software emulation, which produces correct cross-half results on wave64 hardware. Also drop the QD_AMDGPU_FORCE_PERMLANE64_FALLBACK env-var escape hatch and correct the related comments in llvm_context.cpp, runtime.cpp and test_simt.py.

Gating permlane64 off CDNA (prev commit) stops the SIGSEGV but exposed a latent correctness bug: the cross-half shuffle helper is RDNA-shaped. It masks ds_bpermute's target to 31 and relies on permlane64 to fetch the top half, which is correct only where ds_bpermute is SIMD32-scoped (RDNA). On CDNA ds_bpermute is wave64-wide, so masking to 31 means every lane reads a bottom-half lane and the top half is never reached -- shuffle_xor(v,32) returned [32..63, 32..63] instead of [32..63, 0..31] on gfx942 (CI never caught this: RDNA uses the native instruction, not this path). Make the lowering architecture-aware via two JIT-patched knobs: - amdgpu_ds_bpermute_lane_mask(): 63 on GCN/CDNA (gfx9xx, wave64-wide ds_bpermute), 31 on RDNA (SIMD32-scoped). With mask 63 a single wide ds_bpermute already returns lane target_lane for the whole wave. - permlane64 patched to the identity on CDNA, so the helper's cross-SIMD branch equals the same-SIMD branch and the per-lane select is a true no-op (and the intrinsic, which has no MC opcode on CDNA, is never emitted). RDNA paths are unchanged: gfx11/gfx12 keep native v_permlane64_b32, gfx10.x keeps the LDS-roundtrip emulation, both with lane mask 31.

The subgroup docs described emitting v_permlane64_b32 on CDNA as "well-defined and free", which is exactly the pre-fix behavior that crashed the AMDGPU backend (genesis-world #2962). Document the actual per-arch lowering: a single wave-wide ds_bpermute on CDNA (no permlane64) and the permlane64 + ds_bpermute + select pairing on RDNA wave64.

github-actions · 2026-06-17T21:40:11Z

Total: 3 file(s) changed, +23 -8 code lines.

github-actions · 2026-06-17T22:43:42Z

Diff coverage: 0% · 0 lines, 0 missing

github-actions · 2026-06-23T13:14:32Z

Total: 3 file(s) changed, +23 -8 code lines.

github-actions · 2026-06-23T14:18:27Z

Diff coverage: 0% · 0 lines, 0 missing

hughperkins added 4 commits June 17, 2026 12:24

[Misc] AMDGPU: clang-format reflow of permlane64 emulation comment

cda264c

hughperkins mentioned this pull request Jun 17, 2026

[Bug]: Segfault (SIGSEGV) in scene.build() with an articulated robot on AMD gfx942 (CDNA3 / MI300-class), works on gfx950 (CDNA4 / MI350) Genesis-Embodied-AI/genesis-world#2962

Open

Merge branch 'main' into hp/amdgpu-no-permlane64-on-cdna

406a932

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA#746

[AMDGPU] Don't emit llvm.amdgcn.permlane64 on CDNA#746
hughperkins wants to merge 5 commits into
mainfrom
hp/amdgpu-no-permlane64-on-cdna

hughperkins commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hughperkins commented Jun 17, 2026

Brief Summary

Walkthrough

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant