Skip to content

Add Qwen3-Coder-480B-A35B-Instruct contrib: optimized configs for trn…#66

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3-coder-480b
Open

Add Qwen3-Coder-480B-A35B-Instruct contrib: optimized configs for trn…#66
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/qwen3-coder-480b

Conversation

@jimburtoft
Copy link

Description

Community contribution for serving Qwen3-Coder-480B-A35B-Instruct on trn2.48xlarge via vLLM/NxD Inference. This model uses the existing qwen3_moe architecture already in NxDI, so no custom modeling code is needed. The contribution provides optimized configurations, launch scripts, benchmarks, and comprehensive documentation of architectural constraints specific to this 480B-parameter variant.
Key architectural differences from the supported Qwen3-235B (8 KV heads vs 4, head_dim=192 vs 128, 160 experts vs 128, hidden_size=6144 vs 5120) required specific optimization work on SDK 2.28 to achieve full NKI kernel compatibility and determine HBM-safe operating points.
Model Information
Model Name: Qwen3-Coder-480B-A35B-Instruct
Model Architecture: Mixture-of-Experts (MoE) decoder-only transformer (480B total, 35B active per token, 160 experts, 96 attention heads, 8 KV heads, head_dim=192)
Purpose: Code generation and instruction following

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md (../contrib/CONTRIBUTING.md) for detailed guidelines.
Required Components

  • Accuracy Test (test/integration/test_model.py)
    • 8 integration tests validating generation quality (Fibonacci, math, translation), TTFT, throughput, and concurrent serving
    • Tests run against a live vLLM server via the OpenAI-compatible API
    • All tests pass on trn2.48xlarge with SDK 2.28
  • README.md with the following sections:
    • Usage Example: vLLM Quick Start with curl test command
    • Compatibility Matrix: Tested SDK versions and instance types
    • Example Checkpoints: Link to Qwen/Qwen3-Coder-480B-A35B-Instruct (https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct)
    • Testing Instructions: pytest command and standalone execution
  • Source Code (src/)
    • qwen3_coder_vllm.sh -- vLLM launch script (throughput + long-context configs)
    • generation_qwen3_coder_demo.py -- NxDI direct usage example
    • bench_qwen3_coder.py -- Benchmark script (TTFT, throughput, concurrent, quality)
    • init.py -- Notes that this model uses the existing qwen3_moe architecture (no custom modeling code)

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • Not applicable: model uses existing qwen3_moe implementation, no custom modeling code to unit test

Folder Structure

Confirm your contribution follows this structure:
/contrib/models/Qwen3-Coder-480B-A35B-Instruct/
README.md
/configs
throughput_optimized.json # 8192 ctx, BS=16, auto-bucketing (recommended)
long_context.json # 16384 ctx, BS=8
/src
init.py
qwen3_coder_vllm.sh
generation_qwen3_coder_demo.py
bench_qwen3_coder.py
/test
init.py
/unit
init.py
/integration
init.py
test_model.py

Testing

How did you test this change?
All configurations were tested on trn2.48xlarge (64 NeuronCores, LNC=2) with Neuron SDK 2.28 (Deep Learning AMI Neuron (Ubuntu 24.04) 20260227). Testing included:

  1. Progressive kernel enablement -- each QKV/attention/async kernel enabled one-by-one and validated
  2. Batch size sweeps at 8192 and 16384 sequence lengths to find HBM-safe operating points
  3. Context parallelism testing (cp_degree=8, cp_degree=16) to attempt longer contexts
  4. Auto-bucketing vs single-bucket comparison (3.3x throughput improvement)
  5. Generation quality validation with Fibonacci, math, and translation prompts at every config change
    Test Results:
    Throughput-optimized config (8192 context, BS=16, auto-bucketing):
    Concurrency Aggregate TPS
    1 14.73 tok/s
    2 28.14 tok/s
    4 43.42 tok/s
    8 73.23 tok/s
  • TTFT (short prompt): 0.85s
  • Peak single-request decode: 15.3 tok/s
  • Compile time: ~22 min (7 CTE + 7 TKG buckets)
  • Weight load time: ~10 min
    Long-context config (16384, BS=8):
    Concurrency Aggregate TPS
    1 8.37 tok/s
    8 10.41 tok/s
  • TTFT: 18.47s
  • Max batch_size=8 (BS=12 OOM at NEFF load, BS=16 OOM at compile)

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.28 (neuronx-cc 2.22, neuronx-distributed-inference 0.7)
  • Instance Type(s): trn2.48xlarge (64 NeuronCores, LNC=2)
  • PyTorch Version: 2.9
  • Python Version: 3.12 (Ubuntu 24.04 DLAMI)
    Instance/SDK 2.28
    trn2.48xlarge Validated
    trn2.3xlarge Not enough NeuronCores (requires TP=64)
    trn1 / Inf2 Not tested

Additional Information

Issues Resolved

  • QKV NKI kernel with 8 KV heads: Failed on SDK 2.27 due to hardcoded 4 KV head assumption. Works natively on SDK 2.28 in pure TP=64 config.
  • Expert Parallelism selective loading: 160 experts with top_k=8 triggers selective loading for batch_size < 20. Workaround: batch_size >= 20, or use pure TP (EP=1) which avoids the code path entirely.
    Known Limitations
  • flash_decoding_enabled must be false -- kv_shared parallel group assertion fails with 8 KV heads in pure TP=64
  • cp_degree=16 causes Internal Compiler Error -- head_dim=192 exceeds SBUF partition limits during CTE linking
  • cp_degree=8 produces garbage output for seq_len > 8192 -- likely NxDI bug for head_dim=192 models
  • NKI QKV kernels incompatible with attention_dp_degree > 1 -- shape mismatch in qkv.py
  • Max context 8192 at BS=16 (or 16384 at BS=8) -- HBM limited by hidden_size=6144
    Key Optimization: Auto-Bucketing
    Removing explicit context_encoding_buckets and token_generation_buckets lets NxDI auto-generate 7 buckets (128, 256, 512, 1024, 2048, 4096, 8192). Combined with removing on_device_sampling_config (vLLM samples on CPU), this achieved 3.3x aggregate throughput improvement (22 -> 73 tok/s at 8 concurrent) and 8.4x TTFT improvement (7.14s -> 0.85s for short prompts).
    Related Issues
    None.

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions
    The model uses the existing qwen3_moe model type and is served via vllm.entrypoints.openai.api_server with VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'. No custom vLLM registration is needed. See src/qwen3_coder_vllm.sh for the complete launch script.

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines (../contrib/CONTRIBUTING.md)
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

…2.48xlarge

Adds community contrib for serving Qwen3-Coder-480B-A35B-Instruct on
trn2.48xlarge via vLLM/NxDI. The model uses the existing qwen3_moe
architecture but requires specific configuration due to architectural
differences from Qwen3-235B (8 KV heads, head_dim=192, 160 experts).

Includes:
- Throughput-optimized config (8192 ctx, BS=16, auto-bucketing): 73 tok/s @8 concurrent
- Long-context config (16384 ctx, BS=8): 10.4 tok/s @8 concurrent
- vLLM launch script, NxDI direct usage example, benchmark script
- Integration tests (vLLM API-based)
- Comprehensive known issues documentation (flash_decoding, CP, EP constraints)

Validated on SDK 2.28, trn2.48xlarge.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant