GEMM + ReduceScatter with Workgroup Specialization Example by knwng · Pull Request #317 · ROCm/iris

knwng · 2026-01-13T17:59:54Z

Motivation

To add an example of GEMM + ReduceScatter by workgroup specialization. Resolve #178

Technical Details

It's an one-shot GEMM + ReduceScatter kernel, using atomic_add to do reduce in-place.

Test Plan

As discussed, it's been tested locally.

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py

mawad-amd

Thanks for the PR, Kyle! I know it is a draft but I left a couple of comments.

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py

knwng · 2026-01-20T16:37:16Z

Hi @mawad-amd , as you mentioned in #169, do I need to add a test for this like https://github.com/ROCm/iris/blob/main/tests/examples/test_all_load_bench.py?

Copilot

Pull request overview

This PR introduces a GEMM + ReduceScatter example that uses workgroup specialization to overlap computation and communication on AMD GPUs. The implementation divides SMs into GEMM workgroups for matrix multiplication and communication workgroups for scatter operations.

Changes:

Added validation function for reduce-scatter operations
Implemented persistent GEMM kernel with integrated ReduceScatter using workgroup specialization
Created benchmark infrastructure with timing, validation, and tracing capabilities

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
examples/common/validation.py	Added `validate_reduce_scatter` function to verify reduce-scatter correctness
examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py	Core kernel implementing GEMM + ReduceScatter with SM specialization
examples/22_gemm_one_shot_reduce_scatter_wg_specialization/matmul_wrapper.py	PyTorch autograd wrapper for the GEMM kernel
examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py	Benchmark script with validation, timing, and distributed setup

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/matmul_wrapper.py

mawad-amd

Looks good. Thanks, Kyle!

knwng and others added 3 commits January 12, 2026 01:26

add GEMM+ReduceScatter w/ workgroup specialization

0741c51

Apply Ruff auto-fixes

6da76b4

cleanup

b468024

mawad-amd reviewed Jan 14, 2026

View reviewed changes

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py Outdated Show resolved Hide resolved

mawad-amd reviewed Jan 14, 2026

View reviewed changes

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/gemm_reduce_scatter.py Outdated Show resolved Hide resolved

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py Outdated Show resolved Hide resolved

knwng and others added 3 commits January 14, 2026 16:32

address comment

83ca440

clean up

df1bd9d

Apply Ruff auto-fixes

30d02d1

knwng marked this pull request as ready for review January 20, 2026 17:59

knwng requested review from BKP and neoblizz as code owners January 20, 2026 17:59

Copilot AI review requested due to automatic review settings January 20, 2026 17:59

Copilot AI reviewed Jan 20, 2026

View reviewed changes

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/benchmark.py Outdated Show resolved Hide resolved

examples/22_gemm_one_shot_reduce_scatter_wg_specialization/matmul_wrapper.py Outdated Show resolved Hide resolved

knwng requested a review from mawad-amd January 20, 2026 18:04

address comments

a5d4203

mawad-amd approved these changes Jan 20, 2026

View reviewed changes

mawad-amd merged commit 9352d1a into ROCm:main Jan 20, 2026
16 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM + ReduceScatter with Workgroup Specialization Example#317

GEMM + ReduceScatter with Workgroup Specialization Example#317
mawad-amd merged 7 commits intoROCm:mainfrom
knwng:gemm_rs_fused_pc

knwng commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

mawad-amd left a comment

Uh oh!

Uh oh!

Uh oh!

knwng commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

knwng commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

knwng commented Jan 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

mawad-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

knwng commented Jan 13, 2026 •

edited

Loading