Add TopK Gating Softmax Kernel by amd-wsung102 · Pull Request #426 · ROCm/FlyDSL

amd-wsung102 · 2026-04-22T06:30:21Z

Motivation

FlyDSL currently does not have a TopK Gating Softmax kernel, but the operations in this kernel are used in models like GPT-OSS 120B.

Relevant Files

kernels/topk_gating_softmax_kernel.py - topk gating softmax kernel
tests/kernels/test_topk_gating_softmax.py - unit test for the kernel

Optimizations

Packed 32 tokens into a block, similar to vLLM's multi-token per block style
Every reduction is a sub-warp butterfly shuffle of width 8 (THREADS_PER_TOKEN), eliminating the need for shared memory and barriers
Wide vector loads: Each thread issues one 128-bit BufferCopy that pulls all bf16 experts at once

Test Result - Kernel Level

Tested on MI350. Unit test passed.
Around 1.4x-1.5x performance improvement over the vLLM TopKGatingSoftmax kernel, which is used in models like GPT-OSS 120B. Also 1.1x-1.6x performance improvement over the current AITER version.

	Number of Blocks	Number of Tokens	med (us)	p99 (us)	Speedup over vLLM	Speedup over AITER
vLLM	1	T ≤ 16	4.96	5.56	baseline
vLLM	2	T = 32	5.4	5.96	baseline
vLLM	4	T = 64	5.56	6.16	baseline
vLLM	8	T = 128	6	7.08	baseline
vLLM	256	T = 4096	6.2	7.28	baseline
vLLM	1024	T = 16384	7.76	9.84	baseline
AITER	1	T = 32	5.58	5.68		baseline
AITER	1	T = 64	5.67	5.84		baseline
AITER	2	T = 128	5.88	5.98		baseline
AITER	64	T = 4096	5.88	6.14		baseline
AITER	256	T = 16384	6.18	6.89		baseline
FlyDSL	1	T ≤ 32	3.48	3.68	1.5517241	1.6034483
FlyDSL	2	T = 64	3.92	4.12	1.4183673	1.4464286
FlyDSL	4	T = 128	4.12	4.4	1.4563107	1.4271845
FlyDSL	128	T = 4096	4.72	4.96	1.3135593	1.2457627
FlyDSL	512	T = 16384	5.4	5.64	1.437037	1.1444444

Test Result - E2E GPT-OSS 120B

Config: 1k8k / TP=8 / BS = 4

	Kernel	Mean (us)	Median (us)
FlyDSL	topk_gating_softmax_kernel_0	4.64	4.64
vLLM	void vllm::moe::topkGatingSoftmax<std::bfloat16_t, 16, 128, 8, 32, true, 0, (vllm::moe::SharedExpertScoringFunc)0>(std::bfloat16_t const, bool const, float, int, int, int*, int, int, int, int, ...	4.64	4.64

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

… to vLLM

coderfeli · 2026-04-23T14:51:40Z

CI failed. @amd-wsung102

amd-wsung102 and others added 4 commits April 9, 2026 18:41

First iteration

c4b1c8e

Pack 32 tokens per block instead of just one token per block, similar…

a8cd18a

… to vLLM

Improved Topk gating performance

8b62bb9

Unit test fix

333a9ea

Merge branch 'main' into topkgatingsoftmax

9c4ed10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TopK Gating Softmax Kernel#426

Add TopK Gating Softmax Kernel#426
amd-wsung102 wants to merge 5 commits intoROCm:mainfrom
amd-wsung102:topkgatingsoftmax

amd-wsung102 commented Apr 22, 2026 •

edited

Loading

Uh oh!

coderfeli commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amd-wsung102 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Relevant Files

Optimizations

Test Result - Kernel Level

Test Result - E2E GPT-OSS 120B

Submission Checklist

Uh oh!

coderfeli commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amd-wsung102 commented Apr 22, 2026 •

edited

Loading