Skip to content

feat(qmoe): support 2-bit expert weights in CPU kernel#28336

Open
Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Rishi-Dave:rishidave/feat/qmoe-2bit-weights
Open

feat(qmoe): support 2-bit expert weights in CPU kernel#28336
Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Rishi-Dave:rishidave/feat/qmoe-2bit-weights

Conversation

@Rishi-Dave
Copy link
Copy Markdown
Contributor

Summary

  • Extend the CPU QMoE contrib op to accept expert_weight_bits=2, unpacking four 2-bit values per byte LSB-first to match the convention used by MatMulNBits.
  • Generalize the zero-point pack size formula to 8 / num_bits so 2-bit zero points pack four-per-byte, while preserving existing 4-bit and 8-bit behavior.
  • Update the schema attribute description to list the valid values (2, 4, 8).

Motivation

Quark-quantized 2-bit Mixture-of-Experts models cannot currently run inference in ORT because the CPU kernel hard-rejects expert_weight_bits values other than 4 or 8. This change closes that gap on the CPU path.

Fixes #28163

Changes

  • onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.cc
    • Loosen the expert_weight_bits validation to accept 2, 4, or 8.
    • Add a 2-bit unpack branch ((byte >> ((c % 4) * 2)) & 0x3) alongside the existing 4-bit branch in both the row-wise and block-wise asymmetric paths inside DequantizeBlockWithMlas.
    • Generalize zp_pack_size to 8 / num_bits (yields 4 / 2 / 1 for 2 / 4 / 8 bit; behaviorally identical for 4 and 8 bit).
  • onnxruntime/core/graph/contrib_ops/contrib_defs.cc
    • Update the expert_weight_bits attribute description to enumerate the valid values.
  • onnxruntime/test/python/transformers/test_qmoe_cpu.py
    • Generalize the row-wise and block-wise quant helpers to take a bits parameter (defaults to 4 to preserve existing tests) and support 2-bit packing.
    • Add 2-bit row-wise and block-wise CPU QMoE test cases.

This intentionally leaves the MLAS LUT GEMM fast path for 2-bit out of scope — the dequant + MlasGemm path is sufficient for correctness, and a follow-up can add the LUT fast path once the surface here is settled.

Test Plan

  • Existing 4-bit and 8-bit row-wise / block-wise CPU tests in test_qmoe_cpu.py continue to pass (the helpers default to bits=4).
  • New 2-bit row-wise and block-wise tests construct small QMoE models, run the CPU kernel with expert_weight_bits=2, and check the output stays within a loose tolerance of the float reference. The tolerance is widened for 2-bit because the format is intrinsically lossy.
  • lintrunner is clean on the changed files.

Local CPU build/test execution was constrained by the dev environment (no prebuilt onnxruntime present and no incremental build dir to reuse), so the new tests are exercised via CI on this PR.

Extend the CPU QMoE contrib op to accept expert_weight_bits=2.
Weights are unpacked as four LSB-first 2-bit values per byte,
matching the convention used by MatMulNBits.

The zero-point pack size formula is generalized to 8 / num_bits
so 2-bit zero points pack four-per-byte, while 4-bit and 8-bit
behavior is preserved.

The schema attribute description is updated to list the valid
values (2, 4, 8). Python parity tests are extended with a 2-bit
generalization of the row-wise and block-wise quant helpers and
new 2-bit test cases.

Fixes microsoft#28163
@tianleiwu
Copy link
Copy Markdown
Contributor

The PR has overlap with #28185

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] QMoE: support 2-bits quantized expert Weights

2 participants