feat(qmoe): support 2-bit expert weights in CPU kernel by Rishi-Dave · Pull Request #28336 · microsoft/onnxruntime

Rishi-Dave · 2026-05-03T12:30:53Z

Summary

Extend the CPU QMoE contrib op to accept expert_weight_bits=2, unpacking four 2-bit values per byte LSB-first to match the convention used by MatMulNBits.
Generalize the zero-point pack size formula to 8 / num_bits so 2-bit zero points pack four-per-byte, while preserving existing 4-bit and 8-bit behavior.
Update the schema attribute description to list the valid values (2, 4, 8).

Motivation

Quark-quantized 2-bit Mixture-of-Experts models cannot currently run inference in ORT because the CPU kernel hard-rejects expert_weight_bits values other than 4 or 8. This change closes that gap on the CPU path.

Fixes #28163

Changes

onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc
- Loosen the expert_weight_bits validation to accept 2, 4, or 8.
- Add a 2-bit unpack branch ((byte >> ((c % 4) * 2)) & 0x3) alongside the existing 4-bit branch in both the row-wise and block-wise asymmetric paths inside DequantizeBlockWithMlas.
- Generalize zp_pack_size to 8 / num_bits (yields 4 / 2 / 1 for 2 / 4 / 8 bit; behaviorally identical for 4 and 8 bit).
onnxruntime/core/graph/contrib_ops/contrib_defs.cc
- Update the expert_weight_bits attribute description to enumerate the valid values.
onnxruntime/test/python/transformers/test_qmoe_cpu.py
- Generalize the row-wise and block-wise quant helpers to take a bits parameter (defaults to 4 to preserve existing tests) and support 2-bit packing.
- Add 2-bit row-wise and block-wise CPU QMoE test cases.

This intentionally leaves the MLAS LUT GEMM fast path for 2-bit out of scope — the dequant + MlasGemm path is sufficient for correctness, and a follow-up can add the LUT fast path once the surface here is settled.

Test Plan

Existing 4-bit and 8-bit row-wise / block-wise CPU tests in test_qmoe_cpu.py continue to pass (the helpers default to bits=4).
New 2-bit row-wise and block-wise tests construct small QMoE models, run the CPU kernel with expert_weight_bits=2, and check the output stays within a loose tolerance of the float reference. The tolerance is widened for 2-bit because the format is intrinsically lossy.
lintrunner is clean on the changed files.

Local CPU build/test execution was constrained by the dev environment (no prebuilt onnxruntime present and no incremental build dir to reuse), so the new tests are exercised via CI on this PR.

Extend the CPU QMoE contrib op to accept expert_weight_bits=2. Weights are unpacked as four LSB-first 2-bit values per byte, matching the convention used by MatMulNBits. The zero-point pack size formula is generalized to 8 / num_bits so 2-bit zero points pack four-per-byte, while 4-bit and 8-bit behavior is preserved. The schema attribute description is updated to list the valid values (2, 4, 8). Python parity tests are extended with a 2-bit generalization of the row-wise and block-wise quant helpers and new 2-bit test cases. Fixes microsoft#28163

tianleiwu · 2026-05-03T15:55:26Z

The PR has overlap with #28185

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(qmoe): support 2-bit expert weights in CPU kernel#28336

feat(qmoe): support 2-bit expert weights in CPU kernel#28336
Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Rishi-Dave:rishidave/feat/qmoe-2bit-weights

Rishi-Dave commented May 3, 2026

Uh oh!

tianleiwu commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Rishi-Dave commented May 3, 2026

Summary

Motivation

Changes

Test Plan

Uh oh!

tianleiwu commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants