Add Qwen3-Coder-480B-A35B-Instruct contrib: optimized configs for trn…#66
Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
Open
Add Qwen3-Coder-480B-A35B-Instruct contrib: optimized configs for trn…#66jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
Conversation
…2.48xlarge Adds community contrib for serving Qwen3-Coder-480B-A35B-Instruct on trn2.48xlarge via vLLM/NxDI. The model uses the existing qwen3_moe architecture but requires specific configuration due to architectural differences from Qwen3-235B (8 KV heads, head_dim=192, 160 experts). Includes: - Throughput-optimized config (8192 ctx, BS=16, auto-bucketing): 73 tok/s @8 concurrent - Long-context config (16384 ctx, BS=8): 10.4 tok/s @8 concurrent - vLLM launch script, NxDI direct usage example, benchmark script - Integration tests (vLLM API-based) - Comprehensive known issues documentation (flash_decoding, CP, EP constraints) Validated on SDK 2.28, trn2.48xlarge.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Community contribution for serving Qwen3-Coder-480B-A35B-Instruct on trn2.48xlarge via vLLM/NxD Inference. This model uses the existing qwen3_moe architecture already in NxDI, so no custom modeling code is needed. The contribution provides optimized configurations, launch scripts, benchmarks, and comprehensive documentation of architectural constraints specific to this 480B-parameter variant.
Key architectural differences from the supported Qwen3-235B (8 KV heads vs 4, head_dim=192 vs 128, 160 experts vs 128, hidden_size=6144 vs 5120) required specific optimization work on SDK 2.28 to achieve full NKI kernel compatibility and determine HBM-safe operating points.
Model Information
Model Name: Qwen3-Coder-480B-A35B-Instruct
Model Architecture: Mixture-of-Experts (MoE) decoder-only transformer (480B total, 35B active per token, 160 experts, 96 attention heads, 8 KV heads, head_dim=192)
Purpose: Code generation and instruction following
Checklist
Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md (../contrib/CONTRIBUTING.md) for detailed guidelines.
Required Components
Optional Components
Folder Structure
Confirm your contribution follows this structure:
/contrib/models/Qwen3-Coder-480B-A35B-Instruct/
README.md
/configs
throughput_optimized.json # 8192 ctx, BS=16, auto-bucketing (recommended)
long_context.json # 16384 ctx, BS=8
/src
init.py
qwen3_coder_vllm.sh
generation_qwen3_coder_demo.py
bench_qwen3_coder.py
/test
init.py
/unit
init.py
/integration
init.py
test_model.py
Testing
How did you test this change?
All configurations were tested on trn2.48xlarge (64 NeuronCores, LNC=2) with Neuron SDK 2.28 (Deep Learning AMI Neuron (Ubuntu 24.04) 20260227). Testing included:
Test Results:
Throughput-optimized config (8192 context, BS=16, auto-bucketing):
Concurrency Aggregate TPS
1 14.73 tok/s
2 28.14 tok/s
4 43.42 tok/s
8 73.23 tok/s
Long-context config (16384, BS=8):
Concurrency Aggregate TPS
1 8.37 tok/s
8 10.41 tok/s
Compatibility
Tested with:
Instance/SDK 2.28
trn2.48xlarge Validated
trn2.3xlarge Not enough NeuronCores (requires TP=64)
trn1 / Inf2 Not tested
Additional Information
Issues Resolved
Known Limitations
Key Optimization: Auto-Bucketing
Removing explicit context_encoding_buckets and token_generation_buckets lets NxDI auto-generate 7 buckets (128, 256, 512, 1024, 2048, 4096, 8192). Combined with removing on_device_sampling_config (vLLM samples on CPU), this achieved 3.3x aggregate throughput improvement (22 -> 73 tok/s at 8 concurrent) and 8.4x TTFT improvement (7.14s -> 0.85s for short prompts).
Related Issues
None.
vLLM Integration
The model uses the existing qwen3_moe model type and is served via vllm.entrypoints.openai.api_server with VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'. No custom vLLM registration is needed. See src/qwen3_coder_vllm.sh for the complete launch script.
By submitting this PR, I confirm that: