Skip to content

Refine hgemm split-k barrier#435

Open
xytpai wants to merge 7 commits intomainfrom
xyt/hgemm_fix
Open

Refine hgemm split-k barrier#435
xytpai wants to merge 7 commits intomainfrom
xyt/hgemm_fix

Conversation

@xytpai
Copy link
Copy Markdown
Contributor

@xytpai xytpai commented Apr 24, 2026

Motivation

  1. Fix memory issue during cuda-graph capture. (use device signal instead of host)
  2. Add bias support.
  3. Remove preshuffle path.
  4. Optimize b_to_lds path for MI355.

About split-k barrier

We adopt the following principles:

  • 1st block launched within split-k group : zero c, trigger signal.
  • all blocks spin wait at the barrier until the signal is triggered.
  • last block arrived split-k barrier within split-k group: clean semaphore, clean signal.

Some results in model

m n k hipblas (us) flydsl (us) speedup
32 384 7268 11 5 2.184
4 384 7168 13.7 6 2.277
8 5120 2880 13.9 10.7 1.292

@xytpai xytpai requested a review from coderfeli April 24, 2026 09:59
coderfeli
coderfeli previously approved these changes Apr 24, 2026
@xytpai xytpai requested a review from coderfeli April 25, 2026 00:06
@xytpai xytpai changed the title Refine hgemm split-k kernel Refine hgemm split-k barrier Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants