feat(qmoe): support 2-bit expert weights in CPU kernel#28336
Open
Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Open
feat(qmoe): support 2-bit expert weights in CPU kernel#28336Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Rishi-Dave wants to merge 1 commit intomicrosoft:mainfrom
Conversation
Extend the CPU QMoE contrib op to accept expert_weight_bits=2. Weights are unpacked as four LSB-first 2-bit values per byte, matching the convention used by MatMulNBits. The zero-point pack size formula is generalized to 8 / num_bits so 2-bit zero points pack four-per-byte, while 4-bit and 8-bit behavior is preserved. The schema attribute description is updated to list the valid values (2, 4, 8). Python parity tests are extended with a 2-bit generalization of the row-wise and block-wise quant helpers and new 2-bit test cases. Fixes microsoft#28163
Contributor
|
The PR has overlap with #28185 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
QMoEcontrib op to acceptexpert_weight_bits=2, unpacking four 2-bit values per byte LSB-first to match the convention used byMatMulNBits.8 / num_bitsso 2-bit zero points pack four-per-byte, while preserving existing 4-bit and 8-bit behavior.Motivation
Quark-quantized 2-bit Mixture-of-Experts models cannot currently run inference in ORT because the CPU kernel hard-rejects
expert_weight_bitsvalues other than 4 or 8. This change closes that gap on the CPU path.Fixes #28163
Changes
onnxruntime/contrib_ops/cpu/moe/moe_quantization_cpu.ccexpert_weight_bitsvalidation to accept 2, 4, or 8.(byte >> ((c % 4) * 2)) & 0x3) alongside the existing 4-bit branch in both the row-wise and block-wise asymmetric paths insideDequantizeBlockWithMlas.zp_pack_sizeto8 / num_bits(yields 4 / 2 / 1 for 2 / 4 / 8 bit; behaviorally identical for 4 and 8 bit).onnxruntime/core/graph/contrib_ops/contrib_defs.ccexpert_weight_bitsattribute description to enumerate the valid values.onnxruntime/test/python/transformers/test_qmoe_cpu.pybitsparameter (defaults to 4 to preserve existing tests) and support 2-bit packing.This intentionally leaves the MLAS LUT GEMM fast path for 2-bit out of scope — the dequant +
MlasGemmpath is sufficient for correctness, and a follow-up can add the LUT fast path once the surface here is settled.Test Plan
test_qmoe_cpu.pycontinue to pass (the helpers default tobits=4).expert_weight_bits=2, and check the output stays within a loose tolerance of the float reference. The tolerance is widened for 2-bit because the format is intrinsically lossy.lintrunneris clean on the changed files.Local CPU build/test execution was constrained by the dev environment (no prebuilt onnxruntime present and no incremental build dir to reuse), so the new tests are exercised via CI on this PR.