Flydsl rmsnorm by kudomcho · Pull Request #2889 · ROCm/aiter

kudomcho · 2026-04-23T23:02:38Z

Motivation

This PR improves the performance of the FlyDSL RMSNorm kernel by addressing inefficiencies observed in production-like workloads (e.g., GPT-style shapes such as N=2880).

The previous implementation suffered from:

heavy reliance on scalar operations in non-aligned cases
high per-row overhead in generic paths
inconsistent performance across different shapes

The goal of this PR is to:

eliminate scalar bottlenecks
maximize vectorized execution
improve performance consistency across both aligned and non-aligned workloads

Technical Details

This PR introduces several optimizations to the RMSNorm kernel:

Dual execution paths

Tile-fast path (aligned shapes): fully vectorized using buffer_load/store
Vector-generic path (arbitrary shapes): vectorized bulk + minimal scalar tail. This ensures most workloads avoid expensive scalar execution.

Vectorized memory access

Standardized on VEC_WIDTH = 8
Improved memory coalescing and throughput using vectorized buffer ops

Reduced scalar overhead

Scalar operations are now restricted to the final tail only
Eliminates full-row scalar fallback in generic cases

Block reduction simplification

Replaced multi-buffer reduction with a single shared-memory reduction
Reduced synchronization and shared memory traffic

Loop unrolling

Added UNROLL = 2 for medium-width bf16/f16 workloads (N <= 4096)
Improves instruction-level parallelism and reduces loop overhead

Bounded input caching (generic path)

Cached x values during reduction pass to avoid reloading in normalization pass
Enabled only for medium-width cases to control register pressure

Register pressure optimization

Removed input caching from fast tiled path
Avoids excessive register usage and potential spills

Test Plan

The unit-test against Pytorch reference and Benchmark against AITER script is provided as test_rmsnorm_bench_against_aiter.py. One can run from flydsl dir as python test_rmsnorm_bench_against_aiter.py

Test Result

All cases from dimensions of GPT OSS 120B for RMSnorm operators were passed. Speedup Improvements present on dimension 16384 with 40% improvement. The remaining cases are improved by 1-2 us on average.

======================================================================
SUMMARY

PASS flydsl_rmsnorm_M3000_N2880_torch.bfloat16 max_delta= 0.015321 close=100.00% AITER= 12.97us FlyDSL= 29.11us speedup= 0.446x
PASS aiter_rmsnorm_M3000_N2880_torch.bfloat16 max_delta= 0.015321 close=100.00% AITER= 12.97us FlyDSL= 29.11us speedup= 0.446x
PASS flydsl_rmsnorm_M4000_N2880_torch.bfloat16 max_delta= 0.015002 close=100.00% AITER= 12.96us FlyDSL= 27.71us speedup= 0.468x
PASS aiter_rmsnorm_M4000_N2880_torch.bfloat16 max_delta= 0.015002 close=100.00% AITER= 12.96us FlyDSL= 27.71us speedup= 0.468x
PASS flydsl_rmsnorm_M5000_N2880_torch.bfloat16 max_delta= 0.015440 close=100.00% AITER= 15.96us FlyDSL= 27.96us speedup= 0.571x
PASS aiter_rmsnorm_M5000_N2880_torch.bfloat16 max_delta= 0.015440 close=100.00% AITER= 15.96us FlyDSL= 27.96us speedup= 0.571x
PASS flydsl_rmsnorm_M7000_N2880_torch.bfloat16 max_delta= 0.015307 close=100.00% AITER= 14.59us FlyDSL= 27.93us speedup= 0.522x
PASS aiter_rmsnorm_M7000_N2880_torch.bfloat16 max_delta= 0.015307 close=100.00% AITER= 14.59us FlyDSL= 27.93us speedup= 0.522x
PASS flydsl_rmsnorm_M3072_N2880_torch.bfloat16 max_delta= 0.015321 close=100.00% AITER= 12.35us FlyDSL= 27.66us speedup= 0.447x
PASS aiter_rmsnorm_M3072_N2880_torch.bfloat16 max_delta= 0.015321 close=100.00% AITER= 12.35us FlyDSL= 27.66us speedup= 0.447x
PASS flydsl_rmsnorm_M4096_N2880_torch.bfloat16 max_delta= 0.015002 close=100.00% AITER= 12.75us FlyDSL= 27.32us speedup= 0.467x
PASS aiter_rmsnorm_M4096_N2880_torch.bfloat16 max_delta= 0.015002 close=100.00% AITER= 12.75us FlyDSL= 27.32us speedup= 0.467x
PASS flydsl_rmsnorm_M7168_N2880_torch.bfloat16 max_delta= 0.015307 close=100.00% AITER= 15.02us FlyDSL= 27.90us speedup= 0.538x
PASS aiter_rmsnorm_M7168_N2880_torch.bfloat16 max_delta= 0.015307 close=100.00% AITER= 15.02us FlyDSL= 27.90us speedup= 0.538x
PASS flydsl_rmsnorm_M8192_N2880_torch.bfloat16 max_delta= 0.015525 close=100.00% AITER= 16.87us FlyDSL= 27.81us speedup= 0.606x
PASS aiter_rmsnorm_M8192_N2880_torch.bfloat16 max_delta= 0.015525 close=100.00% AITER= 16.87us FlyDSL= 27.81us speedup= 0.606x
PASS flydsl_rmsnorm_M16384_N2880_torch.bfloat16 max_delta= 0.015553 close=100.00% AITER= 30.41us FlyDSL= 29.02us speedup= 1.048x
PASS aiter_rmsnorm_M16384_N2880_torch.bfloat16 max_delta= 0.015553 close=100.00% AITER= 30.41us FlyDSL= 29.02us speedup= 1.048x

18/18 passed

…ck func on kernels_common.py

…ps on normal path

github-actions · 2026-04-23T23:03:14Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:sglang`	SGLang integration tests
`ci:atom`	ATOM benchmark (DeepSeek-R1 + GPT-OSS)
`ci:vllm`	vLLM benchmark
`ci:all`	All of the above

Add labels via the sidebar or gh pr edit 2889 --add-label <label>

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

zhaoh27 and others added 15 commits March 31, 2026 14:56

added rmsnorm unit test and kernel

4eba263

reformatted using black

04aeb34

reformatted using black

63465e5

implemented torch_compile_guard on rmsnorm

dff00e5

Delete aiter/ops/flydsl/rmsnorm.py

bf45871

moved test_rmsnorm to flydsl dir

fd20ed4

Delete aiter/ops/test_rmsnorm.py

5f87047

added rmsnorm kernel

4093a70

added rmsnorm on __init__

8269fc8

renamed rmsnorm kernel file

6f2eaee

added test benchmark for aiter vs flydsl rmsnorm and added kdtype che…

c73327a

…ck func on kernels_common.py

optimized fast path tiling and reduced register use, and vectorized o…

8b5786c

…ps on normal path

stop tracking file

6c2769c

rename unit test abd benchmark script

3df753e

reformatted code via black

bfeb584

kudomcho requested a review from a team April 23, 2026 23:02

kudomcho and others added 12 commits April 23, 2026 18:04

Merge branch 'main' into flydsl_rmsnorm

a69d899

Update aiter/ops/flydsl/__init__.py

38c1d42

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

moreve duplicate get_warP-size function kernels_common.py

8a2f84f

move modeult to top

23c73a8

fixed black format

c2aa1e8

fixed black format

1bb5024

fixed fp32 to pass unit test for dtype fp32

6ea88df

fixed black format

bbf66a0

removed hardcoded warp size

64bfe97

Merge branch 'main' into flydsl_rmsnorm

3b7b6d1

maintain hardcoded warp size as 256

7bdd539

maintain hardcoded warp size as 256

c6f357c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flydsl rmsnorm#2889

Flydsl rmsnorm#2889
kudomcho wants to merge 27 commits intomainfrom
flydsl_rmsnorm

kudomcho commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kudomcho commented Apr 23, 2026

Motivation

Technical Details

Test Plan

Test Result

====================================================================== SUMMARY

Uh oh!

github-actions Bot commented Apr 23, 2026

🏷️ CI Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

======================================================================
SUMMARY