[WebGPU] QKV and MLP fusions for Qwen3 by hariharans29 · Pull Request #28280 · microsoft/onnxruntime

hariharans29 · 2026-04-30T01:30:32Z

Description

Summary

Adds two WebGPU-only graph fusions and the contrib ops they target, plus a small
refactor of the existing MatMulNBits dispatch logic so the new fused kernels
can share its predicates.

Component	Files	Purpose
`MatMulNBitsMlp` op + kernel	`contrib_ops/webgpu/quantization/matmul_nbits_mlp.{cc,h}`, `*.wgsl.template` (3)	Fuses the SwiGLU MLP block: optional `(Skip)SimplifiedLayerNormalization` + two `MatMulNBits` projections (gate, up) + optional biases + `Sigmoid`/`Mul` (SiLU) + element-wise `Mul`. Single dispatch instead of 5–7.
`MatMulNBitsQkv` op + kernel	`contrib_ops/webgpu/quantization/matmul_nbits_qkv.{cc,h}`, `*.wgsl.template`	Fuses `(Skip)SimplifiedLayerNormalization` + three `MatMulNBits` projections (Q, K, V) sharing the same input. Single dispatch instead of 4.
Op schemas	`core/graph/contrib_ops/contrib_defs.cc`	`MatMulNBitsMlp` and `MatMulNBitsQkv` contrib op schemas (kMSDomain, opset 1).
Graph transformers	`core/optimizer/matmul_nbits_{mlp,qkv}_fusion.{cc,h}`	Pattern-match the source subgraphs and emit the fused ops. EP-gated to WebGPU only — no impact on other EPs. Registered in `graph_transformer_utils.cc`.
Dispatch helpers	`contrib_ops/webgpu/quantization/matmul_nbits_common.{cc,h}` + `matmul_nbits.cc`	Extracts the "would this dispatch use Subgroup-Matrix / DP4A / WideTile?" predicates into pure functions reusable by the fused kernels. No behavior change in the unfused `MatMulNBits` path.
Tests	`test/optimizer/matmul_nbits_{mlp,qkv}_fusion_test.cc`, `graph_transform_utils_test.cc`	Unit tests for the new transformers (positive + negative cases).

Motivation and Context

~25-30% decode TPS throughput improvement on WebGPU + D3D backend on Windows. GPU used: RTX 5060Ti for Qwe3-1.7B.

BEFORE (95 decode TPS): main branch

AFTER (120+ decode TPS): PR branch

…xruntime into hari/webgpu_perf_1

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR adds WebGPU-focused fused operators and optimizer passes for decoder-style MatMulNBits patterns (MLP gate/up and QKV projections), along with tests and a microbenchmark to evaluate decode performance/correctness.

Changes:

Introduces new contrib ops MatMulNBitsMlp and MatMulNBitsQkv (schemas + WebGPU kernels + WGSL templates).
Adds graph transformers MatMulNBitsMlpFusion / MatMulNBitsQkvFusion and corresponding optimizer tests.
Improves WebGPU runtime support (graph-capture buffer manager activation, queue-idle wait helper, better shader compilation diagnostics) and adds a decode microbenchmark.

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
onnxruntime/test/optimizer/matmul_nbits_qkv_fusion_test.cc	New unit tests validating QKV fusion and output contracts on WebGPU.
onnxruntime/test/optimizer/matmul_nbits_mlp_fusion_test.cc	New unit tests validating MLP fusion (simplified/skip + passthrough) on WebGPU.
onnxruntime/test/optimizer/graph_transform_utils_test.cc	Minor formatting-only tweak (blank line).
onnxruntime/test/onnx/microbenchmark/webgpu_matmul_nbits_decode.cc	New benchmark harness for fused/unfused decode paths on WebGPU.
onnxruntime/test/onnx/microbenchmark/main.cc	Adjusts benchmark env logging severity.
onnxruntime/core/session/ort_version_check.h	Makes version parsing consteval-friendly with a macro fallback.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.h	Tracks when graph-capture buffer manager is active.
onnxruntime/core/providers/webgpu/webgpu_execution_provider.cc	Lazily creates/activates graph buffer manager for capture; allocator uses dynamic buffer manager getter.
onnxruntime/core/providers/webgpu/webgpu_context.h	Adds `WaitForQueueIdle()` declaration.
onnxruntime/core/providers/webgpu/webgpu_context.cc	Implements `WaitForQueueIdle()` using `OnSubmittedWorkDone`.
onnxruntime/core/providers/webgpu/program_manager.cc	Enhances pipeline build failures with shader compilation diagnostics.
onnxruntime/core/providers/webgpu/compute_context.h	Adds `FlushAndWait()` convenience for flushing + waiting on queue idle.
onnxruntime/core/providers/webgpu/allocator.h	Adds allocator ctor that accepts a buffer-manager getter function.
onnxruntime/core/providers/webgpu/allocator.cc	Implements getter-based allocator to support switching buffer managers.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.h	New transformer declaration for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc	New transformer implementation for QKV fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.h	New transformer declaration for MLP fusion.
onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc	New transformer implementation for MLP fusion.
onnxruntime/core/optimizer/graph_transformer_utils.cc	Registers the new fusion transformers.
onnxruntime/core/graph/contrib_ops/contrib_defs.cc	Adds contrib operator schemas/docs for `MatMulNBitsMlp` and `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/webgpu_contrib_kernels.cc	Registers WebGPU kernels for the new fused ops.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.wgsl.template	New WGSL template implementing fused QKV decode kernel.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.h	New WebGPU kernel wrapper for `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_qkv.cc	New WebGPU kernel implementation for `MatMulNBitsQkv`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp_wide_tile_m1.wgsl.template	New WGSL template for an MLP wide-tile variant.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.wgsl.template	New WGSL template implementing fused MLP (optionally with norm/skip/passthrough).
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.h	New WebGPU kernel wrapper for `MatMulNBitsMlp`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_mlp.cc	New WebGPU kernel implementation for `MatMulNBitsMlp`.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.h	Adds declarations for “would apply” dispatch-selection helpers and shared constants.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits_common.cc	Implements the new dispatch-selection helpers.
onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc	Refactors path selection to use the new “would apply” helpers.
onnxruntime/contrib_ops/webgpu/quantization/dp4a_matmul_mlp.wgsl.template	Adds WGSL template for DP4A MLP path.
cmake/onnxruntime_unittests.cmake	Wires the new WebGPU decode benchmark into the benchmark target sources.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…shader diagnostics These changes are kept on hari/webgpu_perf_1_full locally. The lazy buffer-mgr fix is being submitted as a separate PR (branch hari/webgpu_graph_capture_buffer_fix) because it is an independent correctness fix for a pre-existing latent bug, exposed but not introduced by these fusions.

This template file was added speculatively but is not referenced by any kernel, include, or build rule. Removing to keep the PR clean.

…_transformer_utils

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The shared-EP path through TransformerTester triggers a SEH 0xC0000005 in CI when the EP outlives a per-session profiler whose pointer is still cached on the EP. A separate fix to the WebGPU EP's session_profiler_ lifetime is in flight; meanwhile, switch the 8 MatMulNBits MLP and QKV WebGPU fusion-vs- unfused tests to a small RunWebGpuFusionTransformerTest helper that creates a fresh execution provider per session via a factory lambda. Production code is unchanged.

hariharans29 and others added 10 commits April 8, 2026 15:51

Initial commit

7c0c1b9

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

c42879d

More changes

a0550b6

Merge branch 'hari/webgpu_perf_1' of https://github.com/microsoft/onn…

c55adfe

…xruntime into hari/webgpu_perf_1

Stage

ee09d8e

More changes

aa357ee

Stage

318b26b

Worka nd good perf

ad53b3d

Skip + MatmulNBitsSilu fusion - works and good perf

b67ae81

Cleanup

01671d9

hariharans29 changed the title ~~[DO NOT REVIEW]: Title-TODO~~ [DO NOT REVIEW]: TODO Apr 30, 2026

hariharans29 added 2 commits April 29, 2026 18:35

Move back to workgroup/tile_size default

30485dd

Merge main

27317b8

hariharans29 requested a review from Copilot April 30, 2026 02:25

Copilot started reviewing on behalf of hariharans29 April 30, 2026 02:26 View session

github-actions Bot reviewed Apr 30, 2026

View reviewed changes

github-advanced-security AI found potential problems Apr 30, 2026

View reviewed changes

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc Outdated

Comment thread onnxruntime/core/optimizer/matmul_nbits_qkv_fusion.cc

Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc Outdated

Comment thread onnxruntime/core/optimizer/matmul_nbits_mlp_fusion.cc

hariharans29 added 6 commits April 30, 2026 19:42

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

a56fb56

Copilot comments + Fix builds + Fix lint + Fusion diagrams

13bf979

Fix test

d1090c8

Fix builds

ffacd4c

Fixes

92874ce

hariharans29 mentioned this pull request May 2, 2026

WIP: [WebGPU] Fix stale buffer bindings on first graph-capture replay #28325

Open

hariharans29 changed the title ~~[DO NOT REVIEW]: TODO~~ [WebGPU]: QKV and MLP fusions for Qwen3 May 2, 2026

hariharans29 added 3 commits May 1, 2026 21:00

Remove unused dp4a_matmul_mlp.wgsl.template

2039c7f

This template file was added speculatively but is not referenced by any kernel, include, or build rule. Removing to keep the PR clean.

Cleanup: drop unused empty namespace + env_var_utils include in graph…

a02cf12

…_transformer_utils

Merge remote-tracking branch 'origin' into hari/webgpu_perf_1

beb1709

hariharans29 requested a review from Copilot May 2, 2026 04:05

Copilot started reviewing on behalf of hariharans29 May 2, 2026 04:06 View session

Copilot AI reviewed May 2, 2026

View reviewed changes

hariharans29 changed the title ~~[WebGPU]: QKV and MLP fusions for Qwen3~~ [WebGPU] QKV and MLP fusions for Qwen3 May 2, 2026

hariharans29 added 4 commits May 1, 2026 22:35

Copilot comments

9065063

Fixes

4ac9c81

Fix

306fba3

Conversation

hariharans29 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Motivation and Context

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hariharans29 commented Apr 30, 2026 •

edited

Loading