Skip to content

[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2#1793

Open
BoyuZhang-amd wants to merge 1 commit intoROCm:amd-stagingfrom
BoyuZhang-amd:fix-wmma-hazard-gfx12
Open

[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2#1793
BoyuZhang-amd wants to merge 1 commit intoROCm:amd-stagingfrom
BoyuZhang-amd:fix-wmma-hazard-gfx12

Conversation

@BoyuZhang-amd
Copy link

Summary

This patch fixes a data hazard bug in GFX12 expert scheduling mode 2 where the compiler fails to insert s_wait_alu depctr_va_vdst(0) when a load instruction overwrites WMMA/XDL source registers before the WMMA instruction has finished reading them.

Problem

In expert scheduling mode (-mllvm -amdgpu-expert-scheduling-mode=1), the VA_VDST counter only tracks VALU destination registers, not source registers. For XDL/WMMA instructions that execute asynchronously, the source registers are read over multiple cycles. If a subsequent load instruction overwrites these source registers before the WMMA completes, a data race occurs.

Example of the bug

v_wmma_f32_16x16x16_f16 v[1:8], v[97:100], v[177:180], v[1:8]
; Missing: s_wait_alu depctr_va_vdst(0)
ds_load_b128 v[97:100], v197 offset:56352  ; Overwrites v[97:100] while WMMA reads it!

The ds_load_b128 overwrites v[97:100] before the WMMA instruction has finished reading from it, causing incorrect computation results that are non-deterministic.

Root Cause

In WaitcntBrackets::updateByEvent(), when processing VA_VDST events, the code only sets scores for destination registers (Op.isDef()), skipping all source registers (Op.isUse()):

if (!Op.isReg() || (T == VA_VDST && Op.isUse()) ||  // Skips all uses!
    (T == VM_VSRC && Op.isDef()))
  continue;

This means WMMA source registers like v[97:100] have no VA_VDST score, so when the subsequent ds_load wants to overwrite them, generateWaitcntInstBefore() doesn't detect a dependency and doesn't insert the required wait.

Proposed Fix

This patch modifies the VA_VDST handling to also track source registers for XDL/WMMA instructions, since these instructions execute asynchronously and their source registers must not be overwritten until completion:

if (T == VA_VDST) {
  // Track destination register writes (all VALU)
  if (Op.isDef()) {
    setScoreByOperand(Op, T, CurrScore);
  }
  // Track XDL/WMMA source register reads (async execution hazard)
  else if (Op.isUse() && Context->TII->isXDL(Inst)) {
    setScoreByOperand(Op, T, CurrScore);
  }
}

This is a minimal fix that addresses the issue. Open to suggestions if there's a better approach - for example, whether this logic should be handled differently or if there are other cases that need similar treatment.

Testing

  • Tested with a minimal reproducer that triggers the bug 100% of the time without the fix
  • After the fix, the reproducer produces deterministic correct results
  • The fix correctly inserts s_wait_alu depctr_va_vdst(0) before loads that overwrite WMMA source registers

Reproducer

// Minimal WMMA hazard reproducer
// Pattern: WMMA reads fragV, then ds_load overwrites fragV
// Bug: missing s_wait_alu depctr_va_vdst(0) causes data corruption

#define V_LDS(t, c) (*(half8_t*)(V_shared + (t) * WMMA_K * ld_v + (c) + v_off))
#define WMMA_OI(t) Oi[t] = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32_gfx12(fragP, fragV[t], Oi[t])

for (int c = 0; c < 5; ++c) {
    const int nc = (c + 1) * WMMA_K;
    
    // WMMA reads fragV[0], fragV[1] asynchronously
    __builtin_amdgcn_s_setprio(0);
    WMMA_OI(0); WMMA_OI(1);
    __builtin_amdgcn_s_setprio(0);
    // BUG: ds_load overwrites fragV[0], fragV[1] while WMMA still reading
    // Missing: s_wait_alu depctr_va_vdst(0)
    if (c < 4) { fragV[0] = V_LDS(0, nc); fragV[1] = V_LDS(1, nc); }
    
    __builtin_amdgcn_s_setprio(0);
    WMMA_OI(2); WMMA_OI(3);
    __builtin_amdgcn_s_setprio(0);
    if (c < 4) { fragV[2] = V_LDS(2, nc); fragV[3] = V_LDS(3, nc); }
    
    // ... repeat for fragV[4:7]
}

Generated assembly showing the bug:

v_wmma_f32_16x16x16_f16 v[41:48], v[65:68], v[72:75], v[41:48]
v_wmma_f32_16x16x16_f16 v[57:64], v[65:68], v[76:79], v[57:64]
s_setprio 0
; Missing: s_wait_alu depctr_va_vdst(0)
ds_load_b128 v[72:75], v71 offset:32    ; <-- Overwrites v[72:75] while WMMA reads!
ds_load_b128 v[76:79], v71 offset:2592  ; <-- Overwrites v[76:79] while WMMA reads!

Impact

  • Affected targets: GFX12+ with expert scheduling mode enabled
  • Affected instructions: WMMA, SWMMAC, DOT (all XDL instructions)
  • No performance regression expected: The fix only adds waits where they are actually needed to prevent data hazards

Notes

I'm relatively new to the AMDGPU backend, so I'd appreciate any feedback on whether this is the right place/approach for this fix, or if there are edge cases I might have missed.

@z1-cciauto
Copy link
Collaborator

@jayfoad
Copy link

jayfoad commented Mar 18, 2026

Thanks for working on this!

The fix should really go upstream.

Please add a lit test case, maybe in test/CodeGen/AMDGPU/expert_scheduling_gfx12.mir.

For XDL/WMMA instructions that execute asynchronously, the source registers are read over multiple cycles. If a subsequent load instruction overwrites these source registers before the WMMA completes, a data race occurs.

I did not know about this. Is there any documentation that confirms it, e.g. in the Shader Programming Guide?

@jayfoad
Copy link

jayfoad commented Mar 18, 2026

Cc @stepthomas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic nit, but these braces are unnecessary.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These braces too.

@stepthomas
Copy link

This is news to me too. This looks sufficient to me, but as Jay says it's better to go in upstream LLVM.

@BoyuZhang-amd BoyuZhang-amd force-pushed the fix-wmma-hazard-gfx12 branch from 21580a8 to 3332dce Compare March 19, 2026 07:24
@z1-cciauto
Copy link
Collaborator

@BoyuZhang-amd
Copy link
Author

Hi @jayfoad @stepthomas ,

Thanks for the review feedback! I've removed unnecessary braces and added a lit test case in test/CodeGen/AMDGPU/expert_scheduling_gfx12.mir.

Regarding documentation: Sorry I couldn't find any public technical documentation that describes about expert scheduling mode. However, I came across some internal discussions from 2024 where you both mentioned:

  1. On GFX12+, WMMA instructions may complete out-of-order relative to other VALU instructions
  2. Multiple WMMA operations can be scheduled back-to-back, allowing other independent instructions to be issued while they are running

I discovered this problem while enabling expert scheduling mode for a kernel I was working on - computation errors that I can't explain. And after applying this patch, my kernel produces correct and consistent results.

I suspect you both have a deeper understanding of the underlying hardware behavior than I do, since my investigation was largely guided by your previous discussions. I'd appreciate any feedback on whether this approach is correct, or if there are additional edge cases I should consider.

@jayfoad
Copy link

jayfoad commented Mar 19, 2026

I've looked into this some more. You are guarding against a WAR dependency, VALU <- VGPR <- VMEM. This is actually mentioned in the documentation but it seems that we never implemented it. I guess it was never required for "normal" fast VALU instructions that read all of their inputs in the first cycle.

As for the implementation:

  • I am not sure if this is only required for XDL/WMMA instructions. What about other slow or multi-cycle instructions? Should we do it for all instructions, except for possibly a list of known "fast" instructions?
  • I think your patch would also insert a wait for a "read after read" dependency, VALU <- VGPR -> VMEM, which we really do not want. I'm not sure how to avoid this.

@BoyuZhang-amd
Copy link
Author

Hi @jayfoad @stepthomas ,

Thank you for the feedback on the WAR and RAR concern. I've been investigating further and made an interesting discovery.

I compared hipBLASLt's handling of expert scheduling mode 2 with the compiler's approach. Interestingly, hipBLASLt has additional s_wait_alu depctr_va_vdst(0) insertions that the compiler doesn't have.

Specifically, hipBLASLt inserts va_vdst waits after WMMA/MFMA instructions to ensure the destination register writes complete before subsequent instructions read from them. The pattern is:

v_wmma_f32_16x16x16_f16 v[acc], v[A], v[B], v[acc]
s_wait_alu depctr_va_vdst(0) ; wait for acc write to complete
; ... subsequent instructions that read v[acc]. can be another v_wmma_f32_16x16x16_f16 v[acc], ... , v[acc] ...
This is a VMMA (special VALU) -> VGPR -> VALU RAW dependency case, which according to the comment in compiler will be managed by hardware in expert scheduling mode, but specially treated in hipBLASLt as it should be managed by compiler.

This is a different hazard from what I originally described (VALU <- VGPR <- VMEM WAR dependency). Both issues result in the same symptom (needing more va_vdst waits to produce correct results), but they address different dependency types.

With a different way of thinking, I'd like to try adding the same handling as hipBLASLt to the compiler to see if this also fixes the computation errors in my kernel. This may take some time, but I'll update this PR with any progress!

@jayfoad
Copy link

jayfoad commented Mar 20, 2026

v_wmma_f32_16x16x16_f16 v[acc], v[A], v[B], v[acc] s_wait_alu depctr_va_vdst(0) ; wait for acc write to complete ; ... subsequent instructions that read v[acc]. can be another v_wmma_f32_16x16x16_f16 v[acc], ... , v[acc] ... This is a VMMA (special VALU) -> VGPR -> VALU RAW dependency case, which according to the comment in compiler will be managed by hardware in expert scheduling mode, but specially treated in hipBLASLt as it should be managed by compiler.

This case really should not require a wait on va_vdst, unless there is some undocumented hardware problem.

@BoyuZhang-amd
Copy link
Author

You're right about the RAW dependency case. After further investigation, I found that the va_vdst insertions in hipBLASLt after WMMA instructions are more like a safety measure at the end of each iteration, rather than addressing a specific RAW hazard as I initially speculated.

However, while investigating deeper, I found an internal ticket (DEGFXMI400-11512) that describes an issue highly consistent with my original findings. It's specifically about WAR dependencies: "before issuing DS loads, we need to wait for WMMA to complete reading the sources". The solution proposed in that ticket is also to insert va_vdst waits to prevent data hazards. The same as this PR aims to.

Although adding these va_vdst waits will have a noticeable performance impact, I believe it's necessary at the compiler level to prevent data hazards and ensure correctness.

For the RAR concern you mentioned. I've analyzed the code more carefully:

The current design has two phases:
Phase 1: Score setting (updateByEvent): Decides which registers to track
Phase 2: Wait insertion (generateWaitcntInstBefore): Checks all VGPR operands (both def and use) against pending scores, then clears VA_VDST wait if the instruction is VALU

My patch sets scores for XDL source registers in phase 1. Since phase 2 checks all operands, this would indeed cause unexpected RAR cases like:
v_wmma v[acc], v[A], v[B], v[acc] ; sets score on v[A]
ds_load v[X], v[A] ; v[A] as address (use) - would incorrectly trigger wait

Avoiding this is not easy. XDL source registers need special treatment, and implementing it properly would require approaches like:

  1. Modify the data structure (add a bitset?) to mark which scores are from XDL source registers.
  2. Split current score into separate "dest score" and "source score" and add conditional logic based on that.
    Both approaches would involve significant changes to the existing code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants