contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model by yahavb · Pull Request #67 · aws-neuron/neuronx-distributed-inference

yahavb · 2026-03-10T21:17:30Z

Description

Community contribution for serving Qwen3-Coder-30B-A3B-Instruct on trn2.48xlarge via NxD Inference. This model uses the existing qwen3_moe architecture already in NxDI, so no custom modeling code is needed. The contribution provides a from_pretrained-compatible config wrapper, integration tests, and comprehensive documentation ofcompilation/inference results.

Model Information

Model Name: Qwen3-Coder-30B-A3B-Instruct
Model Architecture: Mixture-of-Experts (MoE) decoder-only transformer (30B total, 3B active per token, 128 experts, top-8 softmax routing, 48 layers, 32 Q heads,4 KV heads, QK-norm RMSNorm, RoPE, SwiGLU)
Purpose: Code generation and instruction following

Checklist

Required Components

Accuracy Test (test/integration/test_model.py)
- Smoke test: model loads and generates tokens
- Generation test: fibonacci(n) prompt produces correct Python implementation with fibonacci(10) = 55
- All tests pass on trn2.48xlarge with SDK 2.22
README.md with the following sections:
- Architecture Details: comparison table vs Qwen3-MoE-15B-A2B
- Hardware Requirements
- Quick Start: compile + load + generate code examples
- Benchmark Results: compilation metrics, inference throughput, generation quality with actual output
- Weight Conversion Details
- Known Issues and Limitations
- Compatibility Matrix
- Testing Instructions
- Example Checkpoints: Qwen/Qwen3-Coder-30B-A3B-Instruct
Source Code (src/)
- modeling_qwen3_coder_moe.py — Qwen3CoderMoeInferenceConfig (with from_pretrained, MoE config mapping, router dtype) + NeuronQwen3CoderMoeForCausalLM wrapper
- __init__.py — Exports model and config classes
- Note: Uses existing qwen3_moe framework implementation, no custom modeling code needed

Optional Components

Unit Tests — Not applicable: model uses existing qwen3_moe implementation, no custom modeling code to unit test

Folder Structure

contrib/models/Qwen3-Coder-30B-A3B-Instruct/
├── README.md
├── test_model.py
├── src/
│   ├── __init__.py
│   └── modeling_qwen3_coder_moe.py
└── test/
    ├── __init__.py
    ├── integration/
    │   ├── __init__.py
    │   └── test_model.py
    └── unit/
        └── __init__.py

Testing

How did you test this change?

All testing performed on trn2.48xlarge with Neuron SDK 2.22 (neuronx-cc 2.22.12471, torch-neuronx 2.9.0.2.11), Python 3.12, Ubuntu 24.04.

1. Compilation (TP=32, BF16, seq_len=2048, batch_size=1)

CTE HLO generation: 54.4s
TKG HLO generation: 0.8s
TKG NEFF compilation: 118s — Compiler status PASS
CTE NEFF compilation: 209s — Compiler status PASS
Weight sharding: 77.7s → 32 sharded checkpoint files
Total compilation time: 8.6 minutes

2. Model Loading

Pre-sharded checkpoint loading: 14.0s (32 ranks)
Warmup: 1.2s

3. Inference Validation

Prompt: fibonacci(n) (coding model, per porting guide)

Response: = fibonacci(n-1) + fibonacci(n-2) for n > 1
fibonacci(0) = 0
fibonacci(1) = 1

def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# Test the function
print(fibonacci(10))  # Output: 55

# Alternative implementation using memoization for better performance
def fibonacci

✅ Correct recursive formula, base cases, working implementation
✅ Correct test output: fibonacci(10) = 55
100 tokens generated, 2.0 tok/s, greedy decoding

Compatibility

Tested with:

Neuron SDK Version(s): 2.22 (neuronx-cc 2.22.12471, torch-neuronx 2.9.0.2.11)
Instance Type(s): trn2.48xlarge
PyTorch Version: 2.9.1
Python Version: 3.12 (Ubuntu 24.04)

Instance/SDK	2.22+	2.21 and earlier
trn2.48xlarge	✅ Validated (TP=32)	Not tested
trn1.32xlarge	Should work (TP=32)	Not tested
Inf2	Not tested	Not tested

Additional Information

Key Details

Uses existing qwen3_moe architecture in NxDI — no custom attention, MLP, or decoder layer code
Adds Qwen3CoderMoeInferenceConfig that maps HF config attributes and provides from_pretrained support
MoE config: num_experts=128, num_experts_per_tok=8, moe_intermediate_size=768, softmax routing with norm_topk_prob=False, no shared experts
Router dtype forced to float32 for accuracy
Weight conversion handled by framework's convert_qwen3_moe_hf_to_neuron_state_dict (QKV fusion, expert weight fusion, QK-norm renaming, router renaming)

Known Issues

Must use MoENeuronConfig (not NeuronConfig) — model accesses moe_ep_degree
output_attentions and output_hidden_states must be set on config (handled automatically by add_derived_config)

Related Issues

None.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

Add NeuronX Distributed Inference support for Qwen/Qwen3-Coder-30B-A3B-Instruct. Architecture: Qwen3MoeForCausalLM - 48 decoder layers, 30B total params, ~3B active (MoE) - GQA with QK-norm, 128 experts, top-8 softmax routing - Validated: TP=32, BF16, seq_len=2048 on trn1.32xlarge - Compilation: 8.6 min, inference: 2.0 tok/s - Test prompt: fibonacci(n) -> correct Python implementation Wraps the existing framework qwen3_moe implementation with a Qwen3CoderMoeInferenceConfig that adds from_pretrained support.

yahavb force-pushed the contrib/qwen3-coder-30b-a3b-instruct branch from 9e60152 to de8fa0a Compare March 11, 2026 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67

contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67
yahavb wants to merge 1 commit intoaws-neuron:mainfrom
yahavb:contrib/qwen3-coder-30b-a3b-instruct

yahavb commented Mar 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yahavb commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

1. Compilation (TP=32, BF16, seq_len=2048, batch_size=1)

2. Model Loading

3. Inference Validation

Compatibility

Additional Information

Key Details

Known Issues

Related Issues

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yahavb commented Mar 10, 2026 •

edited

Loading