Skip to content

rmsnorm gluon kernel created for gfx1250#2912

Open
amd-jrosas wants to merge 6 commits intomainfrom
jrosas_gluon_rmsnorm
Open

rmsnorm gluon kernel created for gfx1250#2912
amd-jrosas wants to merge 6 commits intomainfrom
jrosas_gluon_rmsnorm

Conversation

@amd-jrosas
Copy link
Copy Markdown

Motivation

Create rmsnorm kernel in gluon for gfx1250

Technical Details

Translated existing triton implementation into a gluon equivalent.

Test Plan

Added a test reference in existing test_rmsnorm.py for gluon implementation.

Test Result

Passed all test condition

@amd-jrosas amd-jrosas requested a review from a team April 24, 2026 14:14
@github-actions
Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests
ci:atom ATOM benchmark (DeepSeek-R1 + GPT-OSS)
ci:vllm vLLM benchmark
ci:all All of the above

Add labels via the sidebar or gh pr edit 2912 --add-label <label>

amd-jrosas and others added 5 commits April 24, 2026 11:10
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
sharedLayoutWeights: gl.constexpr = gl.SwizzledSharedLayout(1, 1, 1, order=[0])

# create a swizzled shared layout for the output
gl.SwizzledSharedLayout(1, 1, 1, order=[1, 0])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't assigned to anything but is probably also not needed since the output is not TDM stored so you don't need a shared layout for it?


# Loop through the rows of the input tensor by NUM_PROG blocks
for row_idx in range(row_start, n_rows, NUM_PROG):
input_ptr + (row_idx * input_row_stride)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't assigned to anything?

rms_norm = a * norm_factor * weights
# store rms norm and the norm factor
gl.store(
rsigma_ptr + row_start, norm_factor.to(rsigma_ptr.dtype.element_ty)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean row_idx?

USE_BLOCK = COL > BLOCK_SIZE
NUM_PROG = min(ROW, get_num_sms())

grid = (NUM_PROG,)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can put min(ROW, get_num_sms()) here.

output = torch.empty_like(input, device=input.device)
rsigma = torch.empty((ROW,), device=input.device, dtype=input.dtype)

MAX_FUSED_SIZE = 65536 // input.element_size()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment for the magic number?

@@ -0,0 +1,39 @@
# SPDX-License-Identifier: MIT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want two different files. We want a single API and the wrapper decides whether to call triton (gfx950 and earlier) or gluon (gfx1250, if a gluon kernel exists).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants