[AMDGPU] Fix missing VA_VDST wait for XDL/WMMA source registers in GFX12 expert scheduling mode 2#1793
Conversation
|
Thanks for working on this! The fix should really go upstream. Please add a lit test case, maybe in
I did not know about this. Is there any documentation that confirms it, e.g. in the Shader Programming Guide? |
|
Cc @stepthomas |
There was a problem hiding this comment.
Minor stylistic nit, but these braces are unnecessary.
|
This is news to me too. This looks sufficient to me, but as Jay says it's better to go in upstream LLVM. |
21580a8 to
3332dce
Compare
|
Hi @jayfoad @stepthomas , Thanks for the review feedback! I've removed unnecessary braces and added a lit test case in test/CodeGen/AMDGPU/expert_scheduling_gfx12.mir. Regarding documentation: Sorry I couldn't find any public technical documentation that describes about expert scheduling mode. However, I came across some internal discussions from 2024 where you both mentioned:
I discovered this problem while enabling expert scheduling mode for a kernel I was working on - computation errors that I can't explain. And after applying this patch, my kernel produces correct and consistent results. I suspect you both have a deeper understanding of the underlying hardware behavior than I do, since my investigation was largely guided by your previous discussions. I'd appreciate any feedback on whether this approach is correct, or if there are additional edge cases I should consider. |
|
I've looked into this some more. You are guarding against a WAR dependency, VALU <- VGPR <- VMEM. This is actually mentioned in the documentation but it seems that we never implemented it. I guess it was never required for "normal" fast VALU instructions that read all of their inputs in the first cycle. As for the implementation:
|
|
Hi @jayfoad @stepthomas , Thank you for the feedback on the WAR and RAR concern. I've been investigating further and made an interesting discovery. I compared hipBLASLt's handling of expert scheduling mode 2 with the compiler's approach. Interestingly, hipBLASLt has additional s_wait_alu depctr_va_vdst(0) insertions that the compiler doesn't have. Specifically, hipBLASLt inserts va_vdst waits after WMMA/MFMA instructions to ensure the destination register writes complete before subsequent instructions read from them. The pattern is: v_wmma_f32_16x16x16_f16 v[acc], v[A], v[B], v[acc] This is a different hazard from what I originally described (VALU <- VGPR <- VMEM WAR dependency). Both issues result in the same symptom (needing more va_vdst waits to produce correct results), but they address different dependency types. With a different way of thinking, I'd like to try adding the same handling as hipBLASLt to the compiler to see if this also fixes the computation errors in my kernel. This may take some time, but I'll update this PR with any progress! |
This case really should not require a wait on va_vdst, unless there is some undocumented hardware problem. |
|
You're right about the RAW dependency case. After further investigation, I found that the va_vdst insertions in hipBLASLt after WMMA instructions are more like a safety measure at the end of each iteration, rather than addressing a specific RAW hazard as I initially speculated. However, while investigating deeper, I found an internal ticket (DEGFXMI400-11512) that describes an issue highly consistent with my original findings. It's specifically about WAR dependencies: "before issuing DS loads, we need to wait for WMMA to complete reading the sources". The solution proposed in that ticket is also to insert va_vdst waits to prevent data hazards. The same as this PR aims to. Although adding these va_vdst waits will have a noticeable performance impact, I believe it's necessary at the compiler level to prevent data hazards and ensure correctness. For the RAR concern you mentioned. I've analyzed the code more carefully: The current design has two phases: My patch sets scores for XDL source registers in phase 1. Since phase 2 checks all operands, this would indeed cause unexpected RAR cases like: Avoiding this is not easy. XDL source registers need special treatment, and implementing it properly would require approaches like:
|
Summary
This patch fixes a data hazard bug in GFX12 expert scheduling mode 2 where the compiler fails to insert
s_wait_alu depctr_va_vdst(0)when a load instruction overwrites WMMA/XDL source registers before the WMMA instruction has finished reading them.Problem
In expert scheduling mode (
-mllvm -amdgpu-expert-scheduling-mode=1), theVA_VDSTcounter only tracks VALU destination registers, not source registers. For XDL/WMMA instructions that execute asynchronously, the source registers are read over multiple cycles. If a subsequent load instruction overwrites these source registers before the WMMA completes, a data race occurs.Example of the bug
The
ds_load_b128overwritesv[97:100]before the WMMA instruction has finished reading from it, causing incorrect computation results that are non-deterministic.Root Cause
In
WaitcntBrackets::updateByEvent(), when processingVA_VDSTevents, the code only sets scores for destination registers (Op.isDef()), skipping all source registers (Op.isUse()):This means WMMA source registers like
v[97:100]have noVA_VDSTscore, so when the subsequentds_loadwants to overwrite them,generateWaitcntInstBefore()doesn't detect a dependency and doesn't insert the required wait.Proposed Fix
This patch modifies the
VA_VDSThandling to also track source registers for XDL/WMMA instructions, since these instructions execute asynchronously and their source registers must not be overwritten until completion:This is a minimal fix that addresses the issue. Open to suggestions if there's a better approach - for example, whether this logic should be handled differently or if there are other cases that need similar treatment.
Testing
s_wait_alu depctr_va_vdst(0)before loads that overwrite WMMA source registersReproducer
Generated assembly showing the bug:
Impact
Notes
I'm relatively new to the AMDGPU backend, so I'd appreciate any feedback on whether this is the right place/approach for this fix, or if there are edge cases I might have missed.