Skip to content

feat(npu): add NPU training support for Qwen3-4B and Qwen3.5-9B in full async and colocate modes#52

Closed
hbamboo wants to merge 11 commits into
redai-infra:mainfrom
hbamboo:ascend-dev-0623
Closed

feat(npu): add NPU training support for Qwen3-4B and Qwen3.5-9B in full async and colocate modes#52
hbamboo wants to merge 11 commits into
redai-infra:mainfrom
hbamboo:ascend-dev-0623

Conversation

@hbamboo

@hbamboo hbamboo commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What

This PR introduces full Ascend NPU support for the Relax training pipeline, covering the Docker image build, device configuration, Megatron runtime
adaptation, upstream dependency patches, and colocated training launch scripts.

Why

To enable Relax RL post-training on Ascend NPU hardware (e.g., Huawei 910B). This allows users to run GRPO, DAPO, and other algorithms on NPU clusters
with the same workflow as GPU, lowering the barrier for domestic hardware adoption.

How

  1. NPU Patch Stack (feat(npu): add Megatron patch stack)
    - Patch Megatron-LM for NPU-compatible checkpointing, Transformer Engine, MoE, MTP, optimizer offload, and distributed runtime behavior (940+
    lines).
    - Patch MindSpeed-Bridge for Qwen3-VL, Gated Delta Net, and model conversion compatibility.
    - Patch torch-memory-saver for colocated NPU training with TMS preload mode.
  2. Device Preservation (feat(npu): preserve Ascend devices)
    - Add RAY_EXPERIMENTAL_NOSET_ASCEND_VISIBLE_DEVICES=1 flag to prevent Ray from overwriting Ascend visible devices.
    - Remove local NPU allocator override from the Ray runtime environment (scripts/entrypoint/local-npu.sh).
  3. Megatron Runtime Adaptation (fix(npu): adapt Megatron runtime)
    - Reload and destroy process groups directly during offloaded save flows instead of fully waking/sleeping the training model.
    - Preserve memory state after weight updates by skipping cache clearing during memory logging.
    - Propagate gradient_accumulation_fusion into the Megatron model provider.
    - Use NPU host memory cache cleanup API (torch_npu.npu.reset_max_memory_reserved) when on NPU devices.
  4. Colocated Training Scripts (feat(npu): add 4x colocate scripts)
    - Add run-qwen3-4B-4xnpu-colocate.sh for Qwen3-4B on 4x NPU.
    - Add run-qwen35-9B-4xnpu-colocate.sh for Qwen3.5-9B on 4x NPU.
    - Configure Ascend, HCCL, and TMS environment variables for local NPU training.
  5. Docker Image Update (fix(npu): update Ascend Dockerfile)
    - Switch to Ascend CANN 8.5.1 Python 3.11 base image.
    - Build PyTorch 2.9.0 NPU stack (torch, torch_npu, triton-ascend).
    - Install updated Megatron-LM, MindSpeed, sglang, and sgl-kernel-npu.
    - Apply NPU patch stack during image build and clean caches to reduce image size.

Testing

  • pre-commit run --all-files passes
  • Tests pass (pytest tests/)
  • New tests added (if applicable)
  • Documentation updated (if applicable)

Type of Change

  • New feature (non-breaking change that adds functionality)
  • Bug fix (non-breaking change that fixes an issue)
  • CI/CD or build changes

wuli_ugliest and others added 6 commits June 23, 2026 19:34
# ⭐ Feature

## Add NPU patch stack for training dependencies

- Patch Megatron-LM for NPU-compatible checkpointing, Transformer Engine, MoE, MTP, optimizer offload, and distributed runtime behavior
- Patch MindSpeed-Bridge for Qwen3-VL, Gated Delta Net, and model conversion compatibility
- Patch torch-memory-saver to support colocated NPU training with TMS preload mode

Co-Authored-By: Claude <noreply@anthropic.com>
# ⭐ Feature

## Preserve Ascend visibility in Ray jobs

- Add the Ray experimental flag that prevents overriding Ascend visible devices
- Remove the local NPU allocator override from the Ray runtime environment

Co-Authored-By: Claude <noreply@anthropic.com>
# 🐛 Bug Fix

## Adapt Megatron runtime for NPU training

- Reload and destroy process groups directly during offloaded save flows instead of fully waking or sleeping the training model
- Preserve memory state after weight updates by avoiding cache clearing during memory logging
- Propagate gradient_accumulation_fusion into the Megatron model provider when available
- Use the NPU host memory cache cleanup API when running on NPU devices

Co-Authored-By: Claude <noreply@anthropic.com>
# ⭐ Feature

## Add 4x NPU colocated training scripts

- Add a Qwen3-4B colocated training launch script for 4x NPU setups
- Add a Qwen3.5-9B colocated training launch script for 4x NPU setups
- Configure Ascend, HCCL, and TMS environment variables for local NPU training
- Submit colocated training jobs through Ray with externally configurable model and data paths

Co-Authored-By: Claude <noreply@anthropic.com>
# 🐛 Bug Fix

## Update the Ascend NPU training image

- Switch the NPU image to the Ascend CANN 8.5.1 Python 3.11 base image
- Build and install the PyTorch 2.9.0 NPU stack, including torch_npu and triton-ascend
- Install updated Megatron-LM, MindSpeed, MindSpeed-Bridge, Megatron-Bridge, sglang, and sgl-kernel-npu versions
- Apply the NPU patch stack during image build and copy bridge integrations into Megatron-LM
- Build sgl-kernel-npu kernels and memory-saver components for colocated NPU training
- Clean package and temporary caches to reduce image size

Co-Authored-By: Claude <noreply@anthropic.com>
针对新pr合入修改MD文档

Signed-off-by: dabuliu123 <270334047@qq.com>
Lw135 and others added 5 commits June 23, 2026 20:19
Signed-off-by: dabuliu123 <270334047@qq.com>
# 🎨 Style

## Fix missing EOF newline in env.yaml

- Add trailing newline at end of configs/env.yaml to comply with POSIX standards
# 🎨 Style

## Fix mdformat table alignment in NPU training doc

- mdformat auto-fix column widths to match content length
feat(scripts): add NPU training scripts and rename colocate scripts

# ⭐ Feature

## Add 8x NPU async training script for Qwen3.5-9B

- Add run-qwen35-9B-8xgpu-async-npu.sh for 8-card NPU fully async training
- Configure --fully-async with --max-staleness 2 and --num-iters-per-train-update 32
- Set --rollout-num-gpus-per-engine 2, --micro-batch-size 2 to avoid OOM
- Add --log-passrate and --skip-eval-before-train in EVAL_ARGS
- Enable --use-tensorboard and --use-clearml in WANDB_ARGS
- Set extra env: SGLANG_SET_CPU_AFFINITY, STREAMS_PER_DEVICE, HCCL_BUFFSIZE, HCCL_OP_EXPANSION_MODE

---

# ♻️ Refactor

## Rename colocate scripts with -npu suffix for clarity

- Rename run-qwen3-4B-4xnpu-colocate.sh -> run-qwen3-4B-4xnpu-colocate-npu.sh
- Rename run-qwen35-9B-4xnpu-colocate.sh -> run-qwen35-9B-4xnpu-colocate-npu.sh

---

# ⚡ Performance

## Tune training hyperparameters

- Qwen3-4B colocate: increase --global-batch-size from 128 to 256
- Qwen3.5-9B colocate: increase --rollout-temperature from 0.8 to 1
- Both colocate scripts: add --use-tensorboard in WANDB_ARGS

refactor(scripts): fix NPU script naming convention

# ♻️ Refactor

## Fix NPU training script naming

- Rename run-qwen3-4B-4xnpu-colocate-npu.sh -> run-qwen3-4B-4xnpu-colocate.sh (drop redundant -npu suffix)
- Rename run-qwen35-9B-4xnpu-colocate-npu.sh -> run-qwen35-9B-4xnpu-colocate.sh (drop redundant -npu suffix)
- Rename run-qwen35-9B-8xgpu-async-npu.sh -> run-qwen35-9B-8xnpu-async.sh (8xgpu -> 8xnpu)

refactor(scripts): update 8xnpu-async script for 8-card NPU setup

# ♻️ Refactor

## Standardize run-qwen35-9B-8xnpu-async.sh for 8-card NPU

- Update header comment to Qwen3.5-9B 8xGPU and fix usage line
- Remove 16-card env vars (SGLANG_SET_CPU_AFFINITY, STREAMS_PER_DEVICE, HCCL_BUFFSIZE, etc.)
- Set ASCEND_RT_VISIBLE_DEVICES to 8 cards (0-7) and add ASCEND_COREDUMP_SIGNAL
- Standardize port ranges and EXP_DIR/MODEL_DIR/DATA_DIR variables
- Fix CKPT paths: use MODEL_DIR with trailing slash, rename mcore dir to 8xnpu
- Increase NUM_ROLLOUT 1000->3000, micro-batch-size 2->1
- Change rollout-num-gpus-per-engine 2->4, mem-fraction-static 0.8->0.7
- Expand sglang-cuda-graph-bs with seq 256 32 512
- Increase sglang-chunked-prefill-size 4096->8192
- Fix ray address to 127.0.0.1:8265, update log filename to npu8-async
- Restore file mode to 755
feat(scripts): add NPU training scripts and rename colocate scripts

Created-by: hZhang111
Commit-by: hZhang111
Merged-by: lixionglong
Description: ## Summary

- **新增** `run-qwen35-9B-8xgpu-async-npu.sh`:8卡 NPU 全异步训练脚本(Qwen3.5-9B)
- **重命名** colocate 脚本加 `-npu` 后缀以明确区分:
  - `run-qwen3-4B-4xnpu-colocate.sh` → `run-qwen3-4B-4xnpu-colocate-npu.sh`
  - `run-qwen35-9B-4xnpu-colocate.sh` → `run-qwen35-9B-4xnpu-colocate-npu.sh`

## Changes

### ⭐ Feature
- 新增 8 卡 NPU 全异步训练脚本,配置 `--fully-async`、`--max-staleness 2`、`--rollout-num-gpus-per-engine 2`、`--micro-batch-size 2`
- 启用 `--use-tensorboard`、`--use-clearml`、`--log-passrate`、`--skip-eval-before-train`

### ♻️ Refactor
- colocate 脚本重命名加 `-npu` 后缀

### ⚡ Performance
- Qwen3-4B: `--global-batch-size` 128→256
- Qwen3.5-9B colocate: `--rollout-temperature` 0.8→1
- 两个 colocate 脚本均新增 `--use-tensorboard`

## Testing
- pre-commit 全部通过(13 hooks passed)
- 3 个 shell 脚本 `bash -n` 语法检查通过

See merge request: hw-pbclouds/Relax!35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants