contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67
Open
yahavb wants to merge 1 commit intoaws-neuron:mainfrom
Open
contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67yahavb wants to merge 1 commit intoaws-neuron:mainfrom
yahavb wants to merge 1 commit intoaws-neuron:mainfrom
Conversation
Add NeuronX Distributed Inference support for Qwen/Qwen3-Coder-30B-A3B-Instruct. Architecture: Qwen3MoeForCausalLM - 48 decoder layers, 30B total params, ~3B active (MoE) - GQA with QK-norm, 128 experts, top-8 softmax routing - Validated: TP=32, BF16, seq_len=2048 on trn1.32xlarge - Compilation: 8.6 min, inference: 2.0 tok/s - Test prompt: fibonacci(n) -> correct Python implementation Wraps the existing framework qwen3_moe implementation with a Qwen3CoderMoeInferenceConfig that adds from_pretrained support.
9e60152 to
de8fa0a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Community contribution for serving Qwen3-Coder-30B-A3B-Instruct on trn2.48xlarge via NxD Inference. This model uses the existing
qwen3_moearchitecture already in NxDI, so no custom modeling code is needed. The contribution provides afrom_pretrained-compatible config wrapper, integration tests, and comprehensive documentation ofcompilation/inference results.Model Information
Checklist
Required Components
Accuracy Test (
test/integration/test_model.py)fibonacci(n)prompt produces correct Python implementation withfibonacci(10) = 55README.md with the following sections:
Source Code (
src/)modeling_qwen3_coder_moe.py—Qwen3CoderMoeInferenceConfig(withfrom_pretrained, MoE config mapping, router dtype) +NeuronQwen3CoderMoeForCausalLMwrapper__init__.py— Exports model and config classesqwen3_moeframework implementation, no custom modeling code neededOptional Components
qwen3_moeimplementation, no custom modeling code to unit testFolder Structure
Testing
How did you test this change?
All testing performed on trn2.48xlarge with Neuron SDK 2.22 (neuronx-cc 2.22.12471, torch-neuronx 2.9.0.2.11), Python 3.12, Ubuntu 24.04.
1. Compilation (TP=32, BF16, seq_len=2048, batch_size=1)
2. Model Loading
3. Inference Validation
Prompt:
fibonacci(n)(coding model, per porting guide)fibonacci(10) = 55Compatibility
Tested with:
Additional Information
Key Details
qwen3_moearchitecture in NxDI — no custom attention, MLP, or decoder layer codeQwen3CoderMoeInferenceConfigthat maps HF config attributes and providesfrom_pretrainedsupportnum_experts=128,num_experts_per_tok=8,moe_intermediate_size=768, softmax routing withnorm_topk_prob=False, no shared expertsconvert_qwen3_moe_hf_to_neuron_state_dict(QKV fusion, expert weight fusion, QK-norm renaming, router renaming)Known Issues
MoENeuronConfig(notNeuronConfig) — model accessesmoe_ep_degreeoutput_attentionsandoutput_hidden_statesmust be set on config (handled automatically byadd_derived_config)Related Issues
None.
By submitting this PR, I confirm that: