optimized fast path by reducing register use and vectorize on normal … by kudomcho · Pull Request #436 · ROCm/FlyDSL

kudomcho · 2026-04-24T19:39:24Z

Motivation

This PR improves the performance of the FlyDSL RMSNorm kernel by addressing inefficiencies observed in production-like workloads (e.g., GPT-style shapes such as N=2880).

The previous implementation suffered from:

heavy reliance on scalar operations in non-aligned cases
high per-row overhead in generic paths
inconsistent performance across different shapes

The goal of this PR is to:

eliminate scalar bottlenecks
maximize vectorized execution
improve performance consistency across both aligned and non-aligned workloads

Technical Details

This PR introduces several optimizations to the RMSNorm kernel:

Dual execution paths

Tile-fast path (aligned shapes): fully vectorized using buffer_load/store
Vector-generic path (arbitrary shapes): vectorized bulk + minimal scalar tail. This ensures most workloads avoid expensive scalar execution.

Vectorized memory access

Standardized on VEC_WIDTH = 8
Improved memory coalescing and throughput using vectorized buffer ops

Reduced scalar overhead

Scalar operations are now restricted to the final tail only
Eliminates full-row scalar fallback in generic cases

Block reduction simplification

Replaced multi-buffer reduction with a single shared-memory reduction
Reduced synchronization and shared memory traffic

Loop unrolling

Added UNROLL = 2 for medium-width bf16/f16 workloads (N <= 4096)
Improves instruction-level parallelism and reduces loop overhead

Bounded input caching (generic path)

Cached x values during reduction pass to avoid reloading in normalization pass
Enabled only for medium-width cases to control register pressure

Register pressure optimization

Removed input caching from fast tiled path
Avoids excessive register usage and potential spills

Test Plan

The unit-test on test_rmsnorm.py within FlyDSL repo. Run python -m tests.kernels.test_rmsnorm

Test Result

================
Running RMSNorm Tests

Testing RMSNorm (M=32768, N=8192, dtype=bf16)
Launching kernel...
[W424 19:43:23.196708004 collection.cpp:1133] Warning: ROCTracer produced duplicate flow start: 1 (function operator())
Kernel avg time: 0.2125 ms via run_perftest (warmup=10, iters=100)
Bandwidth: 5052.71 GB/s
Max absolute error: 1.56e-02 (atol=0.02)
PASSED

================================================================================
ALL TESTS PASSED

…path

kudomcho added 3 commits April 24, 2026 19:38

optimized fast path by reducing register use and vectorize on normal …

68894ed

…path

maintain hardcoded warp size as 256

a0ceb27

moved module to top of file

5f4ce9f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimized fast path by reducing register use and vectorize on normal …#436

optimized fast path by reducing register use and vectorize on normal …#436
kudomcho wants to merge 3 commits intomainfrom
khanin/rmsnorm_opt

kudomcho commented Apr 24, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kudomcho commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

================ Running RMSNorm Tests

================================================================================ ALL TESTS PASSED

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kudomcho commented Apr 24, 2026 •

edited

Loading

================
Running RMSNorm Tests

================================================================================
ALL TESTS PASSED