feat(npu): add NPU training support for Qwen3-4B and Qwen3.5-9B in full async and colocate modes by hbamboo · Pull Request #52 · redai-infra/Relax

hbamboo · 2026-06-23T12:18:35Z

What

This PR introduces full Ascend NPU support for the Relax training pipeline, covering the Docker image build, device configuration, Megatron runtime
adaptation, upstream dependency patches, and colocated training launch scripts.

Why

To enable Relax RL post-training on Ascend NPU hardware (e.g., Huawei 910B). This allows users to run GRPO, DAPO, and other algorithms on NPU clusters
with the same workflow as GPU, lowering the barrier for domestic hardware adoption.

How

NPU Patch Stack (feat(npu): add Megatron patch stack)
- Patch Megatron-LM for NPU-compatible checkpointing, Transformer Engine, MoE, MTP, optimizer offload, and distributed runtime behavior (940+
lines).
- Patch MindSpeed-Bridge for Qwen3-VL, Gated Delta Net, and model conversion compatibility.
- Patch torch-memory-saver for colocated NPU training with TMS preload mode.
Device Preservation (feat(npu): preserve Ascend devices)
- Add RAY_EXPERIMENTAL_NOSET_ASCEND_VISIBLE_DEVICES=1 flag to prevent Ray from overwriting Ascend visible devices.
- Remove local NPU allocator override from the Ray runtime environment (scripts/entrypoint/local-npu.sh).
Megatron Runtime Adaptation (fix(npu): adapt Megatron runtime)
- Reload and destroy process groups directly during offloaded save flows instead of fully waking/sleeping the training model.
- Preserve memory state after weight updates by skipping cache clearing during memory logging.
- Propagate gradient_accumulation_fusion into the Megatron model provider.
- Use NPU host memory cache cleanup API (torch_npu.npu.reset_max_memory_reserved) when on NPU devices.
Colocated Training Scripts (feat(npu): add 4x colocate scripts)
- Add run-qwen3-4B-4xnpu-colocate.sh for Qwen3-4B on 4x NPU.
- Add run-qwen35-9B-4xnpu-colocate.sh for Qwen3.5-9B on 4x NPU.
- Configure Ascend, HCCL, and TMS environment variables for local NPU training.
Docker Image Update (fix(npu): update Ascend Dockerfile)
- Switch to Ascend CANN 8.5.1 Python 3.11 base image.
- Build PyTorch 2.9.0 NPU stack (torch, torch_npu, triton-ascend).
- Install updated Megatron-LM, MindSpeed, sglang, and sgl-kernel-npu.
- Apply NPU patch stack during image build and clean caches to reduce image size.

Testing

pre-commit run --all-files passes
Tests pass (pytest tests/)
New tests added (if applicable)
Documentation updated (if applicable)

Type of Change

New feature (non-breaking change that adds functionality)
Bug fix (non-breaking change that fixes an issue)
CI/CD or build changes

# ⭐ Feature ## Add NPU patch stack for training dependencies - Patch Megatron-LM for NPU-compatible checkpointing, Transformer Engine, MoE, MTP, optimizer offload, and distributed runtime behavior - Patch MindSpeed-Bridge for Qwen3-VL, Gated Delta Net, and model conversion compatibility - Patch torch-memory-saver to support colocated NPU training with TMS preload mode Co-Authored-By: Claude <noreply@anthropic.com>

# ⭐ Feature ## Preserve Ascend visibility in Ray jobs - Add the Ray experimental flag that prevents overriding Ascend visible devices - Remove the local NPU allocator override from the Ray runtime environment Co-Authored-By: Claude <noreply@anthropic.com>

# 🐛 Bug Fix ## Adapt Megatron runtime for NPU training - Reload and destroy process groups directly during offloaded save flows instead of fully waking or sleeping the training model - Preserve memory state after weight updates by avoiding cache clearing during memory logging - Propagate gradient_accumulation_fusion into the Megatron model provider when available - Use the NPU host memory cache cleanup API when running on NPU devices Co-Authored-By: Claude <noreply@anthropic.com>

# ⭐ Feature ## Add 4x NPU colocated training scripts - Add a Qwen3-4B colocated training launch script for 4x NPU setups - Add a Qwen3.5-9B colocated training launch script for 4x NPU setups - Configure Ascend, HCCL, and TMS environment variables for local NPU training - Submit colocated training jobs through Ray with externally configurable model and data paths Co-Authored-By: Claude <noreply@anthropic.com>

# 🐛 Bug Fix ## Update the Ascend NPU training image - Switch the NPU image to the Ascend CANN 8.5.1 Python 3.11 base image - Build and install the PyTorch 2.9.0 NPU stack, including torch_npu and triton-ascend - Install updated Megatron-LM, MindSpeed, MindSpeed-Bridge, Megatron-Bridge, sglang, and sgl-kernel-npu versions - Apply the NPU patch stack during image build and copy bridge integrations into Megatron-LM - Build sgl-kernel-npu kernels and memory-saver components for colocated NPU training - Clean package and temporary caches to reduce image size Co-Authored-By: Claude <noreply@anthropic.com>

针对新pr合入修改MD文档 Signed-off-by: dabuliu123 <270334047@qq.com>

Signed-off-by: dabuliu123 <270334047@qq.com>

# 🎨 Style ## Fix missing EOF newline in env.yaml - Add trailing newline at end of configs/env.yaml to comply with POSIX standards

# 🎨 Style ## Fix mdformat table alignment in NPU training doc - mdformat auto-fix column widths to match content length

feat(scripts): add NPU training scripts and rename colocate scripts # ⭐ Feature ## Add 8x NPU async training script for Qwen3.5-9B - Add run-qwen35-9B-8xgpu-async-npu.sh for 8-card NPU fully async training - Configure --fully-async with --max-staleness 2 and --num-iters-per-train-update 32 - Set --rollout-num-gpus-per-engine 2, --micro-batch-size 2 to avoid OOM - Add --log-passrate and --skip-eval-before-train in EVAL_ARGS - Enable --use-tensorboard and --use-clearml in WANDB_ARGS - Set extra env: SGLANG_SET_CPU_AFFINITY, STREAMS_PER_DEVICE, HCCL_BUFFSIZE, HCCL_OP_EXPANSION_MODE --- # ♻️ Refactor ## Rename colocate scripts with -npu suffix for clarity - Rename run-qwen3-4B-4xnpu-colocate.sh -> run-qwen3-4B-4xnpu-colocate-npu.sh - Rename run-qwen35-9B-4xnpu-colocate.sh -> run-qwen35-9B-4xnpu-colocate-npu.sh --- # ⚡ Performance ## Tune training hyperparameters - Qwen3-4B colocate: increase --global-batch-size from 128 to 256 - Qwen3.5-9B colocate: increase --rollout-temperature from 0.8 to 1 - Both colocate scripts: add --use-tensorboard in WANDB_ARGS refactor(scripts): fix NPU script naming convention # ♻️ Refactor ## Fix NPU training script naming - Rename run-qwen3-4B-4xnpu-colocate-npu.sh -> run-qwen3-4B-4xnpu-colocate.sh (drop redundant -npu suffix) - Rename run-qwen35-9B-4xnpu-colocate-npu.sh -> run-qwen35-9B-4xnpu-colocate.sh (drop redundant -npu suffix) - Rename run-qwen35-9B-8xgpu-async-npu.sh -> run-qwen35-9B-8xnpu-async.sh (8xgpu -> 8xnpu) refactor(scripts): update 8xnpu-async script for 8-card NPU setup # ♻️ Refactor ## Standardize run-qwen35-9B-8xnpu-async.sh for 8-card NPU - Update header comment to Qwen3.5-9B 8xGPU and fix usage line - Remove 16-card env vars (SGLANG_SET_CPU_AFFINITY, STREAMS_PER_DEVICE, HCCL_BUFFSIZE, etc.) - Set ASCEND_RT_VISIBLE_DEVICES to 8 cards (0-7) and add ASCEND_COREDUMP_SIGNAL - Standardize port ranges and EXP_DIR/MODEL_DIR/DATA_DIR variables - Fix CKPT paths: use MODEL_DIR with trailing slash, rename mcore dir to 8xnpu - Increase NUM_ROLLOUT 1000->3000, micro-batch-size 2->1 - Change rollout-num-gpus-per-engine 2->4, mem-fraction-static 0.8->0.7 - Expand sglang-cuda-graph-bs with seq 256 32 512 - Increase sglang-chunked-prefill-size 4096->8192 - Fix ray address to 127.0.0.1:8265, update log filename to npu8-async - Restore file mode to 755

feat(scripts): add NPU training scripts and rename colocate scripts Created-by: hZhang111 Commit-by: hZhang111 Merged-by: lixionglong Description: ## Summary - **新增** `run-qwen35-9B-8xgpu-async-npu.sh`：8卡 NPU 全异步训练脚本（Qwen3.5-9B） - **重命名** colocate 脚本加 `-npu` 后缀以明确区分： - `run-qwen3-4B-4xnpu-colocate.sh` → `run-qwen3-4B-4xnpu-colocate-npu.sh` - `run-qwen35-9B-4xnpu-colocate.sh` → `run-qwen35-9B-4xnpu-colocate-npu.sh` ## Changes ### ⭐ Feature - 新增 8 卡 NPU 全异步训练脚本，配置 `--fully-async`、`--max-staleness 2`、`--rollout-num-gpus-per-engine 2`、`--micro-batch-size 2` - 启用 `--use-tensorboard`、`--use-clearml`、`--log-passrate`、`--skip-eval-before-train` ### ♻️ Refactor - colocate 脚本重命名加 `-npu` 后缀 ### ⚡ Performance - Qwen3-4B: `--global-batch-size` 128→256 - Qwen3.5-9B colocate: `--rollout-temperature` 0.8→1 - 两个 colocate 脚本均新增 `--use-tensorboard` ## Testing - pre-commit 全部通过（13 hooks passed） - 3 个 shell 脚本 `bash -n` 语法检查通过 See merge request: hw-pbclouds/Relax!35

wuli_ugliest and others added 6 commits June 23, 2026 19:34

update: 更新文件 npu-training.md

3c4218b

针对新pr合入修改MD文档 Signed-off-by: dabuliu123 <270334047@qq.com>

hbamboo requested review from NINGBENZHE and Yangruipis as code owners June 23, 2026 12:18

Lw135 and others added 5 commits June 23, 2026 20:19

update: 更新文件 npu-training.md

f3f725e

Signed-off-by: dabuliu123 <270334047@qq.com>

style(config): add missing trailing newline to env.yaml

517113e

# 🎨 Style ## Fix missing EOF newline in env.yaml - Add trailing newline at end of configs/env.yaml to comply with POSIX standards

style(docs): fix table column alignment in npu-training.md

a45c1aa

# 🎨 Style ## Fix mdformat table alignment in NPU training doc - mdformat auto-fix column widths to match content length

hbamboo force-pushed the ascend-dev-0623 branch from d5415b1 to 2f989b4 Compare June 30, 2026 06:40

hbamboo requested review from Aurelius84 and yxyOo as code owners June 30, 2026 11:37

hbamboo force-pushed the ascend-dev-0623 branch from 980e943 to 2f989b4 Compare June 30, 2026 12:29

hbamboo closed this Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(npu): add NPU training support for Qwen3-4B and Qwen3.5-9B in full async and colocate modes#52

feat(npu): add NPU training support for Qwen3-4B and Qwen3.5-9B in full async and colocate modes#52
hbamboo wants to merge 11 commits into
redai-infra:mainfrom
hbamboo:ascend-dev-0623

hbamboo commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hbamboo commented Jun 23, 2026

What

Why

How

Testing

Type of Change

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants