Skip to content

Add vLLM grouped gemm backend for MoE inference#4566

Open
santhnm2 wants to merge 98 commits intoNVIDIA:mainfrom
santhnm2:vllm-grouped-gemm-cuda-graphable
Open

Add vLLM grouped gemm backend for MoE inference#4566
santhnm2 wants to merge 98 commits intoNVIDIA:mainfrom
santhnm2:vllm-grouped-gemm-cuda-graphable

Conversation

@santhnm2
Copy link
Copy Markdown
Contributor

What does this PR do ?

Ports the vLLM grouped gemm kernel to the inference-optimized MoE backend.

Issue tracking

For PRs from open-source community contributors:

  • New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
  • Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

  • I have added relevant unit tests
  • I have added relevant functional tests
  • I have added proper typing to my code Typing guidelines
  • I have added relevant documentation
  • I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

  1. When your PR is ready, click Ready for Review.
  2. An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
    • Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

For MRs into `dev` branch The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

- Fix NCCLAllGatherDispatcher.set_step_metadata writing _valid_tokens_tensor
  to the subclass instead of InferenceAllGatherDispatcherBase, causing the
  Triton kernel to receive None and crash with 'constexpr has no attr is_ptr'
- Add use_allgather_v support to NCCLAllGatherDispatcher: non-CG steps
  (prefill) all-gather actual per-rank token counts, pad to max, AllGather,
  compact on dispatch; expand, ReduceScatter, truncate on combine
- Context passes use_allgather_v=not using_cuda_graph_this_step() for NCCL;
  both context and wrapper now all-gather actual per-rank counts rather than
  trivially filling (dummy forward always eager → use_allgather_v=True)
- Fix using_cuda_graph_this_step missing parentheses (property → method call)
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@santhnm2
Copy link
Copy Markdown
Contributor Author

santhnm2 commented May 1, 2026

/ok to test c49de0c

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@santhnm2
Copy link
Copy Markdown
Contributor Author

santhnm2 commented May 1, 2026

/ok to test f04ad51

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label May 1, 2026
@santhnm2 santhnm2 requested a review from sidsingh-nvidia May 1, 2026 17:23
@santhnm2 santhnm2 dismissed sidsingh-nvidia’s stale review May 1, 2026 17:24

Addressed comments

@svcnvidia-nemo-ci svcnvidia-nemo-ci added Approved All necessary approvals have been made and removed Final Review PR is in the "final review" stage labels May 1, 2026
@santhnm2 santhnm2 enabled auto-merge May 1, 2026 21:48
@santhnm2
Copy link
Copy Markdown
Contributor Author

santhnm2 commented May 1, 2026

/ok to test 4a034a8

Signed-off-by: Keshav Santhanam <ksanthanam@nvidia.com>
@santhnm2
Copy link
Copy Markdown
Contributor Author

santhnm2 commented May 1, 2026

/ok to test e7840c9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: high Run functional tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants