feat: align quant and fused kernels with Triton in FlyDSL by cschenjunlin · Pull Request #421 · ROCm/FlyDSL

cschenjunlin · 2026-04-21T08:17:30Z

Motivation

Align quant and fused kernels with Triton in FlyDSL

Technical Details

Test Plan

Test Result

====================================================================================================
Perf Compare (gpu us): FlyDSL vs AIter
====================================================================================================
op         shape              dtype  FlyDSL(gpu us)  AIter(gpu us)    speedup
rmsnorm_dq 64x256             f32              45.7           66.4      1.45x
rmsnorm_dq 128x1024           f32              43.0           65.2      1.52x
rmsnorm_dq 32x128             f16              43.4           64.1      1.48x
rmsnorm_dq 64x2000            f32              42.0           67.2      1.60x
rmsnorm_dq 16x512             bf16             42.8           63.6      1.49x
rmsnorm_dq 1024x8192          bf16             42.7           65.8      1.54x
rmsnorm_dq 32768x8192         bf16            398.5        1,000.6      2.51x
rmsnorm_sq 64x256             f32              46.5           68.7      1.48x
rmsnorm_sq 128x1024           f32              46.2           70.0      1.52x
rmsnorm_sq 32x128             f16              48.1           70.2      1.46x
rmsnorm_sq 64x2000            f32              47.0           69.8      1.49x
rmsnorm_sq 16x512             bf16             47.3           68.8      1.45x
rmsnorm_sq 1024x8192          bf16             46.6           70.0      1.50x
rmsnorm_sq 32768x8192         bf16            478.2        1,155.5      2.42x
====================================================================================================

Submission Checklist

rmsnorm

quant_rms_norm_kernel
fused_add_rmsnorm_kernel
quant_fused_add_rmsnorm_kernel
rmsnorm_kernel_large_m_small_n

layernorm

fused_add_layernorm_kernel
quant_layernorm_kernel
quant_fused_add_layernorm_kernel

coderfeli · 2026-04-23T07:06:56Z

@cschenjunlin conflicts.

cschenjunlin · 2026-04-24T08:50:02Z

@cschenjunlin conflicts.

The conflicts are caused by the importing of vector, as I have not completed all the replacement of the vector usage in the quant kernels. I will complete all the replacement in the next commit.

cschenjunlin added 2 commits April 17, 2026 16:00

add int8 quant rmsnorm kernel

37c2736

align norm benchmark helpers with launcher-based kernels

78950c5

i-chaochen requested a review from coderfeli April 21, 2026 09:10

cschenjunlin added 3 commits April 24, 2026 14:31

replace all the with

f1089bb

align the benchmark helpers with the API

b1b6a18

remove vector dialect in kernels

d31bf48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: align quant and fused kernels with Triton in FlyDSL#421

feat: align quant and fused kernels with Triton in FlyDSL#421
cschenjunlin wants to merge 5 commits intomainfrom
quant_cjl

cschenjunlin commented Apr 21, 2026 •

edited

Loading

Uh oh!

coderfeli commented Apr 23, 2026

Uh oh!

cschenjunlin commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cschenjunlin commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

coderfeli commented Apr 23, 2026

Uh oh!

cschenjunlin commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cschenjunlin commented Apr 21, 2026 •

edited

Loading