[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2 by BoyuZhang-amd · Pull Request #1793 · ROCm/llvm-project

BoyuZhang-amd · 2026-03-18T02:48:38Z

Summary

This patch fixes a data hazard bug in GFX12 expert scheduling mode 2 where the compiler fails to insert s_wait_alu depctr_va_vdst(0) when a load instruction overwrites WMMA/XDL source registers before the WMMA instruction has finished reading them.

Problem

In expert scheduling mode (-mllvm -amdgpu-expert-scheduling-mode=1), the VA_VDST counter only tracks VALU destination registers, not source registers. For XDL/WMMA instructions that execute asynchronously, the source registers are read over multiple cycles. If a subsequent load instruction overwrites these source registers before the WMMA completes, a data race occurs.

Example of the bug

v_wmma_f32_16x16x16_f16 v[1:8], v[97:100], v[177:180], v[1:8]
; Missing: s_wait_alu depctr_va_vdst(0)
ds_load_b128 v[97:100], v197 offset:56352  ; Overwrites v[97:100] while WMMA reads it!

The ds_load_b128 overwrites v[97:100] before the WMMA instruction has finished reading from it, causing incorrect computation results that are non-deterministic.

Root Cause

In WaitcntBrackets::updateByEvent(), when processing VA_VDST events, the code only sets scores for destination registers (Op.isDef()), skipping all source registers (Op.isUse()):

if (!Op.isReg() || (T == VA_VDST && Op.isUse()) ||  // Skips all uses!
    (T == VM_VSRC && Op.isDef()))
  continue;

This means WMMA source registers like v[97:100] have no VA_VDST score, so when the subsequent ds_load wants to overwrite them, generateWaitcntInstBefore() doesn't detect a dependency and doesn't insert the required wait.

Proposed Fix

This patch modifies the VA_VDST handling to also track source registers for XDL/WMMA instructions, since these instructions execute asynchronously and their source registers must not be overwritten until completion:

if (T == VA_VDST) {
  // Track destination register writes (all VALU)
  if (Op.isDef()) {
    setScoreByOperand(Op, T, CurrScore);
  }
  // Track XDL/WMMA source register reads (async execution hazard)
  else if (Op.isUse() && Context->TII->isXDL(Inst)) {
    setScoreByOperand(Op, T, CurrScore);
  }
}

This is a minimal fix that addresses the issue. Open to suggestions if there's a better approach - for example, whether this logic should be handled differently or if there are other cases that need similar treatment.

Testing

Tested with a minimal reproducer that triggers the bug 100% of the time without the fix
After the fix, the reproducer produces deterministic correct results
The fix correctly inserts s_wait_alu depctr_va_vdst(0) before loads that overwrite WMMA source registers

Reproducer

// Minimal WMMA hazard reproducer
// Pattern: WMMA reads fragV, then ds_load overwrites fragV
// Bug: missing s_wait_alu depctr_va_vdst(0) causes data corruption

#define V_LDS(t, c) (*(half8_t*)(V_shared + (t) * WMMA_K * ld_v + (c) + v_off))
#define WMMA_OI(t) Oi[t] = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12(fragP, fragV[t], Oi[t])

for (int c = 0; c < 5; ++c) {
    const int nc = (c + 1) * WMMA_K;
    
    // WMMA reads fragV[0], fragV[1] asynchronously
    __builtin_amdgcn_s_setprio(0);
    WMMA_OI(0); WMMA_OI(1);
    __builtin_amdgcn_s_setprio(0);
    // BUG: ds_load overwrites fragV[0], fragV[1] while WMMA still reading
    // Missing: s_wait_alu depctr_va_vdst(0)
    if (c < 4) { fragV[0] = V_LDS(0, nc); fragV[1] = V_LDS(1, nc); }
    
    __builtin_amdgcn_s_setprio(0);
    WMMA_OI(2); WMMA_OI(3);
    __builtin_amdgcn_s_setprio(0);
    if (c < 4) { fragV[2] = V_LDS(2, nc); fragV[3] = V_LDS(3, nc); }
    
    // ... repeat for fragV[4:7]
}

Generated assembly showing the bug:

v_wmma_f32_16x16x16_f16 v[41:48], v[65:68], v[72:75], v[41:48]
v_wmma_f32_16x16x16_f16 v[57:64], v[65:68], v[76:79], v[57:64]
s_setprio 0
; Missing: s_wait_alu depctr_va_vdst(0)
ds_load_b128 v[72:75], v71 offset:32    ; <-- Overwrites v[72:75] while WMMA reads!
ds_load_b128 v[76:79], v71 offset:2592  ; <-- Overwrites v[76:79] while WMMA reads!

Impact

Affected targets: GFX12+ with expert scheduling mode enabled
Affected instructions: WMMA, SWMMAC, DOT (all XDL instructions)
No performance regression expected: The fix only adds waits where they are actually needed to prevent data hazards

Notes

I'm relatively new to the AMDGPU backend, so I'd appreciate any feedback on whether this is the right place/approach for this fix, or if there are edge cases I might have missed.

z1-cciauto · 2026-03-18T02:56:42Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/4692

jayfoad · 2026-03-18T10:54:02Z

Thanks for working on this!

The fix should really go upstream.

Please add a lit test case, maybe in test/CodeGen/AMDGPU/expert_scheduling_gfx12.mir.

For XDL/WMMA instructions that execute asynchronously, the source registers are read over multiple cycles. If a subsequent load instruction overwrites these source registers before the WMMA completes, a data race occurs.

I did not know about this. Is there any documentation that confirms it, e.g. in the Shader Programming Guide?

jayfoad · 2026-03-18T10:55:27Z

Cc @stepthomas

stepthomas · 2026-03-18T11:21:58Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

Minor stylistic nit, but these braces are unnecessary.

stepthomas · 2026-03-18T11:22:18Z

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

These braces too.

stepthomas · 2026-03-18T11:25:07Z

This is news to me too. This looks sufficient to me, but as Jay says it's better to go in upstream LLVM.

z1-cciauto · 2026-03-19T07:30:05Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/4716

BoyuZhang-amd · 2026-03-19T08:07:58Z

Hi @jayfoad @stepthomas ,

Thanks for the review feedback! I've removed unnecessary braces and added a lit test case in test/CodeGen/AMDGPU/expert_scheduling_gfx12.mir.

Regarding documentation: Sorry I couldn't find any public technical documentation that describes about expert scheduling mode. However, I came across some internal discussions from 2024 where you both mentioned:

On GFX12+, WMMA instructions may complete out-of-order relative to other VALU instructions
Multiple WMMA operations can be scheduled back-to-back, allowing other independent instructions to be issued while they are running

I discovered this problem while enabling expert scheduling mode for a kernel I was working on - computation errors that I can't explain. And after applying this patch, my kernel produces correct and consistent results.

I suspect you both have a deeper understanding of the underlying hardware behavior than I do, since my investigation was largely guided by your previous discussions. I'd appreciate any feedback on whether this approach is correct, or if there are additional edge cases I should consider.

jayfoad · 2026-03-19T13:44:06Z

I've looked into this some more. You are guarding against a WAR dependency, VALU <- VGPR <- VMEM. This is actually mentioned in the documentation but it seems that we never implemented it. I guess it was never required for "normal" fast VALU instructions that read all of their inputs in the first cycle.

As for the implementation:

I am not sure if this is only required for XDL/WMMA instructions. What about other slow or multi-cycle instructions? Should we do it for all instructions, except for possibly a list of known "fast" instructions?
I think your patch would also insert a wait for a "read after read" dependency, VALU <- VGPR -> VMEM, which we really do not want. I'm not sure how to avoid this.

BoyuZhang-amd · 2026-03-20T10:02:04Z

Hi @jayfoad @stepthomas ,

Thank you for the feedback on the WAR and RAR concern. I've been investigating further and made an interesting discovery.

I compared hipBLASLt's handling of expert scheduling mode 2 with the compiler's approach. Interestingly, hipBLASLt has additional s_wait_alu depctr_va_vdst(0) insertions that the compiler doesn't have.

Specifically, hipBLASLt inserts va_vdst waits after WMMA/MFMA instructions to ensure the destination register writes complete before subsequent instructions read from them. The pattern is:

v_wmma_f32_16x16x16_f16 v[acc], v[A], v[B], v[acc]
s_wait_alu depctr_va_vdst(0) ; wait for acc write to complete
; ... subsequent instructions that read v[acc]. can be another v_wmma_f32_16x16x16_f16 v[acc], ... , v[acc] ...
This is a VMMA (special VALU) -> VGPR -> VALU RAW dependency case, which according to the comment in compiler will be managed by hardware in expert scheduling mode, but specially treated in hipBLASLt as it should be managed by compiler.

This is a different hazard from what I originally described (VALU <- VGPR <- VMEM WAR dependency). Both issues result in the same symptom (needing more va_vdst waits to produce correct results), but they address different dependency types.

With a different way of thinking, I'd like to try adding the same handling as hipBLASLt to the compiler to see if this also fixes the computation errors in my kernel. This may take some time, but I'll update this PR with any progress!

jayfoad · 2026-03-20T12:30:22Z

v_wmma_f32_16x16x16_f16 v[acc], v[A], v[B], v[acc] s_wait_alu depctr_va_vdst(0) ; wait for acc write to complete ; ... subsequent instructions that read v[acc]. can be another v_wmma_f32_16x16x16_f16 v[acc], ... , v[acc] ... This is a VMMA (special VALU) -> VGPR -> VALU RAW dependency case, which according to the comment in compiler will be managed by hardware in expert scheduling mode, but specially treated in hipBLASLt as it should be managed by compiler.

This case really should not require a wait on va_vdst, unless there is some undocumented hardware problem.

BoyuZhang-amd · 2026-03-23T09:11:54Z

You're right about the RAW dependency case. After further investigation, I found that the va_vdst insertions in hipBLASLt after WMMA instructions are more like a safety measure at the end of each iteration, rather than addressing a specific RAW hazard as I initially speculated.

However, while investigating deeper, I found an internal ticket (DEGFXMI400-11512) that describes an issue highly consistent with my original findings. It's specifically about WAR dependencies: "before issuing DS loads, we need to wait for WMMA to complete reading the sources". The solution proposed in that ticket is also to insert va_vdst waits to prevent data hazards. The same as this PR aims to.

Although adding these va_vdst waits will have a noticeable performance impact, I believe it's necessary at the compiler level to prevent data hazards and ensure correctness.

For the RAR concern you mentioned. I've analyzed the code more carefully:

The current design has two phases:
Phase 1: Score setting (updateByEvent): Decides which registers to track
Phase 2: Wait insertion (generateWaitcntInstBefore): Checks all VGPR operands (both def and use) against pending scores, then clears VA_VDST wait if the instruction is VALU

My patch sets scores for XDL source registers in phase 1. Since phase 2 checks all operands, this would indeed cause unexpected RAR cases like:
v_wmma v[acc], v[A], v[B], v[acc] ; sets score on v[A]
ds_load v[X], v[A] ; v[A] as address (use) - would incorrectly trigger wait

Avoiding this is not easy. XDL source registers need special treatment, and implementing it properly would require approaches like:

Modify the data structure (add a bitset?) to mark which scores are from XDL source registers.
Split current score into separate "dest score" and "source score" and add conditional logic based on that.
Both approaches would involve significant changes to the existing code.

stepthomas reviewed Mar 18, 2026

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated

Copy link

stepthomas Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic nit, but these braces are unnecessary.

stepthomas reviewed Mar 18, 2026

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated

Copy link

stepthomas Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These braces too.

fix wmma hazard on gfx12

3332dce

BoyuZhang-amd force-pushed the fix-wmma-hazard-gfx12 branch from 21580a8 to 3332dce Compare March 19, 2026 07:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2#1793

[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2#1793
BoyuZhang-amd wants to merge 1 commit intoROCm:amd-stagingfrom
BoyuZhang-amd:fix-wmma-hazard-gfx12

BoyuZhang-amd commented Mar 18, 2026

Uh oh!

z1-cciauto commented Mar 18, 2026

Uh oh!

jayfoad commented Mar 18, 2026

Uh oh!

jayfoad commented Mar 18, 2026

Uh oh!

stepthomas Mar 18, 2026

Uh oh!

stepthomas Mar 18, 2026

Uh oh!

stepthomas commented Mar 18, 2026

Uh oh!

z1-cciauto commented Mar 19, 2026

Uh oh!

BoyuZhang-amd commented Mar 19, 2026

Uh oh!

jayfoad commented Mar 19, 2026

Uh oh!

BoyuZhang-amd commented Mar 20, 2026

Uh oh!

jayfoad commented Mar 20, 2026

Uh oh!

BoyuZhang-amd commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

BoyuZhang-amd commented Mar 18, 2026

Summary

Problem

Example of the bug

Root Cause

Proposed Fix

Testing

Reproducer

Impact

Notes

Uh oh!

z1-cciauto commented Mar 18, 2026

Uh oh!

jayfoad commented Mar 18, 2026

Uh oh!

jayfoad commented Mar 18, 2026

Uh oh!

stepthomas Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

stepthomas Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

stepthomas commented Mar 18, 2026

Uh oh!

z1-cciauto commented Mar 19, 2026

Uh oh!

BoyuZhang-amd commented Mar 19, 2026

Uh oh!

jayfoad commented Mar 19, 2026

Uh oh!

BoyuZhang-amd commented Mar 20, 2026

Uh oh!

jayfoad commented Mar 20, 2026

Uh oh!

BoyuZhang-amd commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants