Skip to content

Fix FlyDSL split-k barrier synchronization issue#2877

Open
XiaobingSuper wants to merge 4 commits intomainfrom
xiaobing/hgemm_split_fix
Open

Fix FlyDSL split-k barrier synchronization issue#2877
XiaobingSuper wants to merge 4 commits intomainfrom
xiaobing/hgemm_split_fix

Conversation

@XiaobingSuper
Copy link
Copy Markdown
Contributor

@XiaobingSuper XiaobingSuper commented Apr 23, 2026

Summary

Fix FlyDSL split-k barrier synchronization issue for graph mode

@XiaobingSuper XiaobingSuper requested review from a team and Copilot April 23, 2026 06:15
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-355 Run Triton tests on MI355 in addition to MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2877 --add-label <label>

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates FlyDSL’s split-k HGEMM synchronization to use a per-dispatch 64-bit token protocol (based on dispatch_id) to remain correct under graph replay while reducing host-side state and polling overhead.

Changes:

  • Replace rotating split-k “signal state” + int32 counters with an int64 per-dispatch token (dispatch_id + 1, reserving 0 as invalid) and release/acquire publish+poll.
  • Remove host-managed split-k state rotation and shrink the global semaphore to a single int64 counter array.
  • Reduce barrier polling overhead by having only tid==0 poll the semaphore and then synchronizing the block with gpu.barrier().

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
aiter/ops/flydsl/kernels/splitk_hgemm.py Switch split-k semaphore publish/poll to int64 dispatch tokens; remove stale-counter cleanup and signal_state plumbing.
aiter/ops/flydsl/kernels/small_m_hgemm.py Mirror the int64 dispatch-token protocol for small-M split-k; remove stale-counter cleanup and signal_state.
aiter/ops/flydsl/gemm_kernels.py Allocate a per-stream int64 semaphore (length 128) and remove host-side rotating split-k state management.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aiter/ops/flydsl/kernels/splitk_hgemm.py Outdated
Replace the rotating split-k signal state with a 64-bit dispatch token barrier so split-k launches stay graph-safe and avoid stale-counter races. Narrow the polling and small_m publish waits to vmcnt-only to recover the lost performance while keeping the new protocol stable.

Made-with: Cursor
@XiaobingSuper XiaobingSuper force-pushed the xiaobing/hgemm_split_fix branch from bf1da7b to 9bfdc55 Compare April 25, 2026 13:05
@XiaobingSuper XiaobingSuper changed the title Fix FlyDSL split-k barrier synchronization without regressing performance Fix FlyDSL split-k barrier synchronization Apr 25, 2026
@XiaobingSuper XiaobingSuper changed the title Fix FlyDSL split-k barrier synchronization Fix FlyDSL split-k barrier synchronization issue Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants