Skip to content

Support Qwen35 SFT#1498

Open
hhaAndroid wants to merge 10 commits intoInternLM:mainfrom
hhaAndroid:support_qwen35
Open

Support Qwen35 SFT#1498
hhaAndroid wants to merge 10 commits intoInternLM:mainfrom
hhaAndroid:support_qwen35

Conversation

@hhaAndroid
Copy link
Collaborator

@hhaAndroid hhaAndroid commented Feb 28, 2026

支持 Qwen3.5 MoE 模型 CPT 和 SFT 训练

  • 支持 35ba3b moe 模型的 Pack 模式的 SFT 训练
  • 修复之前重构导致的 qwen3vl position id 没有 pad 的 bug
  • 支持模型保存
  • 支持 ep 训练
  • 支持 torch.compile 训练
  • 支持 fp8 训练
  • 新增完善的单元测试
  • 支持序列并行训练(Next PR)
  • 支持 mtp 联合训练(Next PR)
  • 支持 35ba3b moe 模型的 RL 训练(Next PR)

注意事项

  1. Qwen3.5 需要 transformers >=5.2.0,这个版本比较高,相比于 v4.57.0 等版本 BC 较大,暂时 Xtuner 支持 transformers >=5.2.0 的模型只有 Qwen3.5,其余模型例如 qwen3vl 无法使用这个版本的 transformers,后续要专门提 pr 修复
  2. 目前要跑起 Qwen3.5 ,除了之前依赖外,还需要安装 linear attention 相关的依赖 causal_conv1d 和 flash-linear-attention

XTuner 修改

  • 新增 Qwen3_5_VLMoE35BA3Config 配置
  • 新增 Qwen3_5_VLTextMoE 类用于处理 transformers 从 v4.57.0 到 v5.0 导致的 export 权重 bc
  • 新增 GatedDeltaNet,支持 varlen 模式,作为 linear attention 模块 (引入了 causal_conv1d 和 flash-linear-attention 第三方库)
  • 修改 mha 模块,使其支持 gated mha
  • 新增 zero_centered RMSNorm,改动位置较多
  • DecoderLayer 支持基于配置选择当前 layer 是 full attention 还是 linear attention
  • 新增支持 partial_rotary 的 apply_rotary_pos_emb_cuda_for_partial_rotary
  • 修复重构导致的 qwen3vl position id 没有 pad 的错误

TODO

  • 目前 GatedDeltaNet 不支持 torch.compile(full_graph=True) 模式,可以优化
  • 因为上述限制,导致即使当前 layer 是可以 full_graph 支持的,暂时也是 False 模式

@hhaAndroid
Copy link
Collaborator Author

@claude

@claude
Copy link

claude bot commented Feb 28, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@HAOCHENYE
Copy link
Collaborator

@claude

@claude
Copy link

claude bot commented Feb 28, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

1 similar comment
@claude
Copy link

claude bot commented Feb 28, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@hhaAndroid
Copy link
Collaborator Author

@claude

@hhaAndroid hhaAndroid changed the title Support qwen35 Support Qwen35 SFT Feb 28, 2026
@HAOCHENYE
Copy link
Collaborator

@claude

@InternLM InternLM deleted a comment from claude bot Feb 28, 2026
@claude
Copy link

claude bot commented Feb 28, 2026

Claude: ## Review Summary

This PR adds Qwen3.5 MoE model support for CPT/SFT training, introducing GateDeltaNet linear attention, zero-centered RMSNorm, partial rotary embeddings, gated attention, and shared expert gating. Significant feature addition touching 22 files.

Issues

Critical

  • [qwen3_5_text.py:17] Typo "defalut" in Literal type — prevents valid "default" values from being accepted
  • [rms_norm/protocol.py:8] RMSNormProtocol signature incorrectly updated with a type param that no implementation accepts
  • [gate_deltanet.py:20-22] Top-level imports of fla and causal_conv1d (optional deps) will break imports for all users without these packages

Warning

  • [moe_decoder_layer.py:130-134] Removing .float() cast from router gate F.linear may cause numerical instability in mixed-precision training
  • [moe_decoder_layer.py:394-425] ~30 lines of commented-out debug/alignment code should be removed
  • [rms_norm.py:14] type used as parameter name, shadows Python builtin — consider norm_type
  • [modeling_qwen3_vl.py:185-188] Assertion commented out instead of being made conditional — weakens validation for all models
  • [sequence_context.py:104-108] seq_idx computed eagerly for all models, but only needed by GateDeltaNet — consider lazy computation
  • [moe_decoder_layer.py] F.sigmoid is deprecated — use torch.sigmoid() or .sigmoid()

Nit

  • [base.py:150] rms_norm_type is str instead of Literal["default", "zero_centered"] — inconsistent with other usages
  • [rms_norm/init.py] Missing newline at end of file
  • Single quotes used in several new code paths where codebase convention is double quotes
  • Missing unit tests (noted as TODO in PR description)

Verdict

REQUEST_CHANGES

@hhaAndroid hhaAndroid requested a review from HAOCHENYE March 2, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants