Support Qwen35 SFT by hhaAndroid · Pull Request #1498 · InternLM/xtuner

hhaAndroid · 2026-02-28T04:12:22Z

支持 Qwen3.5 MoE 模型 CPT 和 SFT 训练

注意事项

Qwen3.5 需要 transformers >=5.2.0，这个版本比较高，相比于 v4.57.0 等版本 BC 较大，暂时 Xtuner 支持 transformers >=5.2.0 的模型只有 Qwen3.5，其余模型例如 qwen3vl 无法使用这个版本的 transformers，后续要专门提 pr 修复
目前要跑起 Qwen3.5 ，除了之前依赖外，还需要安装 linear attention 相关的依赖 causal_conv1d 和 flash-linear-attention

XTuner 修改

新增 Qwen3_5_VLMoE35BA3Config 配置
新增 Qwen3_5_VLTextMoE 类用于处理 transformers 从 v4.57.0 到 v5.0 导致的 export 权重 bc
新增 GatedDeltaNet，支持 varlen 模式，作为 linear attention 模块 (引入了 causal_conv1d 和 flash-linear-attention 第三方库)
修改 mha 模块，使其支持 gated mha
新增 zero_centered RMSNorm，改动位置较多
DecoderLayer 支持基于配置选择当前 layer 是 full attention 还是 linear attention
新增支持 partial_rotary 的 apply_rotary_pos_emb_cuda_for_partial_rotary
修复重构导致的 qwen3vl position id 没有 pad 的错误

TODO

目前 GatedDeltaNet 不支持 torch.compile(full_graph=True) 模式，可以优化
因为上述限制，导致即使当前 layer 是可以 full_graph 支持的，暂时也是 False 模式

hhaAndroid · 2026-02-28T04:12:57Z

@claude

claude · 2026-02-28T04:13:10Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

HAOCHENYE · 2026-02-28T04:21:27Z

@claude

claude · 2026-02-28T04:21:44Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

claude · 2026-02-28T04:23:11Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

…t_qwen35

hhaAndroid · 2026-02-28T04:40:56Z

@claude

HAOCHENYE · 2026-02-28T09:26:18Z

@claude

xtuner/v1/model/moe/qwen3_5_text.py

xtuner/v1/ops/rms_norm/protocol.py

xtuner/v1/module/attention/gated_deltanet.py

xtuner/v1/module/decoder_layer/moe_decoder_layer.py

xtuner/v1/module/rms_norm/rms_norm.py

xtuner/v1/model/base.py

xtuner/v1/module/attention/gated_deltanet.py

xtuner/v1/ops/rms_norm/__init__.py

xtuner/v1/model/compose/qwen3_vl/modeling_qwen3_vl.py

xtuner/v1/data_proto/sequence_context.py

claude · 2026-02-28T09:30:12Z

Claude: ## Review Summary

This PR adds Qwen3.5 MoE model support for CPT/SFT training, introducing GateDeltaNet linear attention, zero-centered RMSNorm, partial rotary embeddings, gated attention, and shared expert gating. Significant feature addition touching 22 files.

Issues

Critical

[qwen3_5_text.py:17] Typo "defalut" in Literal type — prevents valid "default" values from being accepted
[rms_norm/protocol.py:8] RMSNormProtocol signature incorrectly updated with a type param that no implementation accepts
[gate_deltanet.py:20-22] Top-level imports of fla and causal_conv1d (optional deps) will break imports for all users without these packages

Warning

[moe_decoder_layer.py:130-134] Removing .float() cast from router gate F.linear may cause numerical instability in mixed-precision training
[moe_decoder_layer.py:394-425] ~30 lines of commented-out debug/alignment code should be removed
[rms_norm.py:14] type used as parameter name, shadows Python builtin — consider norm_type
[modeling_qwen3_vl.py:185-188] Assertion commented out instead of being made conditional — weakens validation for all models
[sequence_context.py:104-108] seq_idx computed eagerly for all models, but only needed by GateDeltaNet — consider lazy computation
[moe_decoder_layer.py] F.sigmoid is deprecated — use torch.sigmoid() or .sigmoid()

Nit

[base.py:150] rms_norm_type is str instead of Literal["default", "zero_centered"] — inconsistent with other usages
[rms_norm/init.py] Missing newline at end of file
Single quotes used in several new code paths where codebase convention is double quotes
Missing unit tests (noted as TODO in PR description)

Verdict

REQUEST_CHANGES

hhaAndroid added 3 commits February 26, 2026 11:54

init support qwen3.5

64b4a7c

fix load_weight and forward

5c328d7

fix loss of hf

ce0cad8

hhaAndroid and others added 2 commits February 28, 2026 04:38

support pack

bd8bc8e

Merge branch 'main' of https://github.com/InternLM/xtuner into suppor…

d607ea3

…t_qwen35

fix save hf

1f0060e

hhaAndroid changed the title ~~Support qwen35~~ Support Qwen35 SFT Feb 28, 2026

InternLM deleted a comment from claude bot Feb 28, 2026