Add vLLM grouped gemm backend for MoE inference by santhnm2 · Pull Request #4566 · NVIDIA/Megatron-LM

santhnm2 · 2026-04-30T22:09:26Z

What does this PR do ?

Ports the vLLM grouped gemm kernel to the inference-optimized MoE backend.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

- Fix NCCLAllGatherDispatcher.set_step_metadata writing _valid_tokens_tensor to the subclass instead of InferenceAllGatherDispatcherBase, causing the Triton kernel to receive None and crash with 'constexpr has no attr is_ptr' - Add use_allgather_v support to NCCLAllGatherDispatcher: non-CG steps (prefill) all-gather actual per-rank token counts, pad to max, AllGather, compact on dispatch; expand, ReduceScatter, truncate on combine - Context passes use_allgather_v=not using_cuda_graph_this_step() for NCCL; both context and wrapper now all-gather actual per-rank counts rather than trivially filling (dummy forward always eager → use_allgather_v=True) - Fix using_cuda_graph_this_step missing parentheses (property → method call)

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

…uda-graphable

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-05-01T06:45:46Z

/ok to test c49de0c

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-05-01T16:55:03Z

/ok to test f04ad51

Addressed comments

santhnm2 · 2026-05-01T21:57:24Z

/ok to test 4a034a8

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

santhnm2 · 2026-05-01T22:35:54Z

/ok to test e7840c9

sidsingh-nvidia added 30 commits April 9, 2026 10:31

add agv

9e6c392

add allgatherv

d49c124

add reducescatterv

5617c93

test dispatch combine loop

5fc4a2d

make test more robust

164f24d

merge

dc2ff5f

refactor, remove a2a dispatcher in inference

15f3bfe

Merge branch 'main' into siddharth/all-gather-v-dispatcher

83a1393

latest

b1ca2d1

refactor inference MoE token dispatcher and fix NCCL prefill path

c7e1ce7

more refactoring

00c3916

make flashinfer work

4e5c725

fix transformer config

25b4f59

simplify dummy forward pass

e8e2309

fused allgatherv

f813ff0

remove design file

92f73cf

remove files

ac4fd07

remove test file

05911a6

minor

6e152e6

fix docstring

38b9bd8

simplify context

094b16f

Merge branch 'main' into siddharth/all-gather-v-dispatcher

5255163

cleanup

9e70e01

more cleanup

798d013

add comment

be07644

fix for MTP

c262f25

remove dead argument

43ee48d

optimize permute kernel for mxfp8

90a4121

minor

4e374e0

santhnm2 added 8 commits April 30, 2026 22:46

Remove histogram

bb605d3

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Add TODO comment

2846aec

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Merge remote-tracking branch 'upstream/main' into vllm-grouped-gemm-c…

3dc4d0e

…uda-graphable

Fix return dtype

017f668

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Fix reference dtype

1b00626

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

More dtype fixes

aec1aa0

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

More dtype fixes

b9b9e46

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

Linting

c49de0c

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot Bot temporarily deployed to test May 1, 2026 06:46 Inactive

kvareddy approved these changes May 1, 2026

View reviewed changes

Fix typo

f04ad51

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot Bot temporarily deployed to test May 1, 2026 16:56 Inactive

Victarry approved these changes May 1, 2026

View reviewed changes

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label May 1, 2026

santhnm2 requested a review from sidsingh-nvidia May 1, 2026 17:23

ericharper approved these changes May 1, 2026

View reviewed changes

svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels May 1, 2026

shanmugamr1992 approved these changes May 1, 2026

View reviewed changes

santhnm2 enabled auto-merge May 1, 2026 21:48

Merge branch 'main' into vllm-grouped-gemm-cuda-graphable

4a034a8

copy-pr-bot Bot temporarily deployed to test May 1, 2026 21:58 Inactive

santhnm2 disabled auto-merge May 1, 2026 22:03

fp32 rsv buffer

e7840c9

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>

copy-pr-bot Bot temporarily deployed to test May 1, 2026 22:37 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vLLM grouped gemm backend for MoE inference#4566

Add vLLM grouped gemm backend for MoE inference#4566
santhnm2 wants to merge 98 commits intoNVIDIA:mainfrom
santhnm2:vllm-grouped-gemm-cuda-graphable

santhnm2 commented Apr 30, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

santhnm2 commented Apr 30, 2026

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

santhnm2 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants