Skip to content

contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67

Open
yahavb wants to merge 1 commit intoaws-neuron:mainfrom
yahavb:contrib/qwen3-coder-30b-a3b-instruct
Open

contrib: Add Qwen3-Coder-30B-A3B-Instruct (Qwen3MoE) model#67
yahavb wants to merge 1 commit intoaws-neuron:mainfrom
yahavb:contrib/qwen3-coder-30b-a3b-instruct

Conversation

@yahavb
Copy link

@yahavb yahavb commented Mar 10, 2026

Description

Community contribution for serving Qwen3-Coder-30B-A3B-Instruct on trn2.48xlarge via NxD Inference. This model uses the existing qwen3_moe architecture already in NxDI, so no custom modeling code is needed. The contribution provides a from_pretrained-compatible config wrapper, integration tests, and comprehensive documentation ofcompilation/inference results.

Model Information

  • Model Name: Qwen3-Coder-30B-A3B-Instruct
  • Model Architecture: Mixture-of-Experts (MoE) decoder-only transformer (30B total, 3B active per token, 128 experts, top-8 softmax routing, 48 layers, 32 Q heads,4 KV heads, QK-norm RMSNorm, RoPE, SwiGLU)
  • Purpose: Code generation and instruction following

Checklist

Required Components

  • Accuracy Test (test/integration/test_model.py)

    • Smoke test: model loads and generates tokens
    • Generation test: fibonacci(n) prompt produces correct Python implementation with fibonacci(10) = 55
    • All tests pass on trn2.48xlarge with SDK 2.22
  • README.md with the following sections:

    • Architecture Details: comparison table vs Qwen3-MoE-15B-A2B
    • Hardware Requirements
    • Quick Start: compile + load + generate code examples
    • Benchmark Results: compilation metrics, inference throughput, generation quality with actual output
    • Weight Conversion Details
    • Known Issues and Limitations
    • Compatibility Matrix
    • Testing Instructions
    • Example Checkpoints: Qwen/Qwen3-Coder-30B-A3B-Instruct
  • Source Code (src/)

    • modeling_qwen3_coder_moe.pyQwen3CoderMoeInferenceConfig (with from_pretrained, MoE config mapping, router dtype) + NeuronQwen3CoderMoeForCausalLM wrapper
    • __init__.py — Exports model and config classes
    • Note: Uses existing qwen3_moe framework implementation, no custom modeling code needed

Optional Components

  • Unit Tests — Not applicable: model uses existing qwen3_moe implementation, no custom modeling code to unit test

Folder Structure

contrib/models/Qwen3-Coder-30B-A3B-Instruct/
├── README.md
├── test_model.py
├── src/
│   ├── __init__.py
│   └── modeling_qwen3_coder_moe.py
└── test/
    ├── __init__.py
    ├── integration/
    │   ├── __init__.py
    │   └── test_model.py
    └── unit/
        └── __init__.py

Testing

How did you test this change?

All testing performed on trn2.48xlarge with Neuron SDK 2.22 (neuronx-cc 2.22.12471, torch-neuronx 2.9.0.2.11), Python 3.12, Ubuntu 24.04.

1. Compilation (TP=32, BF16, seq_len=2048, batch_size=1)

  • CTE HLO generation: 54.4s
  • TKG HLO generation: 0.8s
  • TKG NEFF compilation: 118s — Compiler status PASS
  • CTE NEFF compilation: 209s — Compiler status PASS
  • Weight sharding: 77.7s → 32 sharded checkpoint files
  • Total compilation time: 8.6 minutes

2. Model Loading

  • Pre-sharded checkpoint loading: 14.0s (32 ranks)
  • Warmup: 1.2s

3. Inference Validation

Prompt: fibonacci(n) (coding model, per porting guide)

Response: = fibonacci(n-1) + fibonacci(n-2) for n > 1
fibonacci(0) = 0
fibonacci(1) = 1

def fibonacci(n):
    if n <= 1:
        return n
    else:
        return fibonacci(n-1) + fibonacci(n-2)

# Test the function
print(fibonacci(10))  # Output: 55

# Alternative implementation using memoization for better performance
def fibonacci
  • ✅ Correct recursive formula, base cases, working implementation
  • ✅ Correct test output: fibonacci(10) = 55
  • 100 tokens generated, 2.0 tok/s, greedy decoding

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.22 (neuronx-cc 2.22.12471, torch-neuronx 2.9.0.2.11)
  • Instance Type(s): trn2.48xlarge
  • PyTorch Version: 2.9.1
  • Python Version: 3.12 (Ubuntu 24.04)
Instance/SDK 2.22+ 2.21 and earlier
trn2.48xlarge ✅ Validated (TP=32) Not tested
trn1.32xlarge Should work (TP=32) Not tested
Inf2 Not tested Not tested

Additional Information

Key Details

  • Uses existing qwen3_moe architecture in NxDI — no custom attention, MLP, or decoder layer code
  • Adds Qwen3CoderMoeInferenceConfig that maps HF config attributes and provides from_pretrained support
  • MoE config: num_experts=128, num_experts_per_tok=8, moe_intermediate_size=768, softmax routing with norm_topk_prob=False, no shared experts
  • Router dtype forced to float32 for accuracy
  • Weight conversion handled by framework's convert_qwen3_moe_hf_to_neuron_state_dict (QKV fusion, expert weight fusion, QK-norm renaming, router renaming)

Known Issues

  • Must use MoENeuronConfig (not NeuronConfig) — model accesses moe_ep_degree
  • output_attentions and output_hidden_states must be set on config (handled automatically by add_derived_config)

Related Issues

None.


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

Add NeuronX Distributed Inference support for Qwen/Qwen3-Coder-30B-A3B-Instruct.

Architecture: Qwen3MoeForCausalLM
- 48 decoder layers, 30B total params, ~3B active (MoE)
- GQA with QK-norm, 128 experts, top-8 softmax routing
- Validated: TP=32, BF16, seq_len=2048 on trn1.32xlarge
- Compilation: 8.6 min, inference: 2.0 tok/s
- Test prompt: fibonacci(n) -> correct Python implementation

Wraps the existing framework qwen3_moe implementation with a
Qwen3CoderMoeInferenceConfig that adds from_pretrained support.
@yahavb yahavb force-pushed the contrib/qwen3-coder-30b-a3b-instruct branch from 9e60152 to de8fa0a Compare March 11, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant