Support MTP weight reuse with unrolled steps by MDR-EX1000 · Pull Request #29 · modelscope/mcore-bridge

MDR-EX1000 · 2026-04-14T08:15:20Z

中文

前言

Qwen3.5 系列支持 mtp-k 推理（共享一层 MTP 权重），但之前的实现还不支持 mtp-k 的共享权重训练。本次改动增加了“1 层 MTP 权重重复展开 k 次进行训练”的实现，并已验证其在 mtp-k 推理中可以有效提高投机解码的接受率。

P.S. 上游 MS-SWIFT 仓库还需要做小幅改动，后续会补充。

概述

本次 PR 为 MTP 增加了基于逻辑展开的权重复用能力。

变更内容

在运行时配置中新增 mtp_unroll_steps。
在 MultiTokenPredictionBlock.forward 中支持复用物理 MTP layer。
显式传递 depth index，保证展开后的深度语义正确。
在 GPT 的后处理中按逻辑展开深度计算 MTP loss。
为多模态 MTP 的 decoder input 处理新增 decoder_input_detach。
新增针对共享层执行和展开后 loss 计算的测试。

验证

新增测试文件：tests/test_mtp_reuse.py
验证命令：

PYTHONPATH=src:${PYTHONPATH} python -m pytest -q tests/test_mtp_reuse.py

结果：2 passed

说明

当 mtp_num_layers 表示物理层数、mtp_unroll_steps 表示逻辑展开步数时，可启用 MTP 权重复用。

English

Background

The Qwen3.5 series already supports mtp-k inference with a single shared MTP layer, but the previous implementation did not support shared-weight mtp-k training. This change adds support for training by repeatedly unrolling one physical MTP layer for k steps, and it has been validated to improve speculative decoding acceptance rate in mtp-k inference.

P.S. A small follow-up change is still needed in the upstream MS-SWIFT repository, and I will provide it separately.

Summary

This PR adds support for MTP weight reuse through logical unrolling.

What Changed

Add mtp_unroll_steps to runtime config.
Reuse physical MTP layers in MultiTokenPredictionBlock.forward.
Pass explicit depth indices so unrolled steps keep the correct depth semantics.
Compute MTP loss with unrolled depth in GPT postprocess.
Add decoder_input_detach for multimodal MTP decoder input handling.
Add focused tests for shared-layer execution and unrolled loss handling.

Validation

Added tests/test_mtp_reuse.py
Verified with:

PYTHONPATH=src:${PYTHONPATH} python -m pytest -q tests/test_mtp_reuse.py

Result: 2 passed

Notes

MTP weight reuse is enabled when mtp_num_layers is set to the physical layer count and mtp_unroll_steps is set to the logical unroll depth.

- add mtp_unroll_steps config plumbing for reused MTP execution\n- unroll MultiTokenPredictionBlock with shared physical layers and explicit depth indices\n- compute MTP loss using the unrolled depth and keep multimodal decoder input detachable\n- add focused tests covering shared-layer execution and unrolled loss handling

gemini-code-assist

Code Review

This pull request introduces the ability to unroll Multi-Token Prediction (MTP) layers beyond the number of physical layers using a new mtp_unroll_steps configuration. Key changes include updating GPTModel._postprocess to support variable MTP depths, implementing a block_forward method to handle layer reuse, and adding configuration options for detaching decoder inputs. Review feedback identifies critical issues in the block_forward implementation regarding Pipeline Parallelism compatibility, specifically potential division by zero and incorrect depth calculations. Additionally, a potential crash was noted where rotary_pos_emb is rolled without checking if it is null.

gemini-code-assist · 2026-04-14T08:18:29Z

+        offset = get_mtp_layer_offset(self.config, self.vp_stage)
+        hidden_states_list = list(torch.chunk(hidden_states, 1 + offset, dim=0))
+        hidden_states = hidden_states_list[offset]
+
+        physical_num_layers = len(self.layers)
+        unroll_steps = getattr(self.config, 'mtp_unroll_steps', None) or self.config.mtp_num_layers
+
+        for step in range(unroll_steps):
+            layer = self.layers[step % physical_num_layers]
+            global_depth = offset + step + 1
+            hidden_states, input_ids, position_ids = layer(
+                input_ids=input_ids,
+                position_ids=position_ids,
+                hidden_states=hidden_states,
+                attention_mask=attention_mask,
+                inference_params=inference_params,
+                rotary_pos_emb=rotary_pos_emb,
+                rotary_pos_cos=rotary_pos_cos,
+                rotary_pos_sin=rotary_pos_sin,
+                packed_seq_params=packed_seq_params,
+                sequence_len_offset=sequence_len_offset,
+                embedding=embedding,
+                depth_idx=global_depth,
+                **(extra_block_kwargs or {}),
+            )
+            hidden_states_list.append(hidden_states)
+
+        hidden_states = torch.cat(hidden_states_list, dim=0)
+        return hidden_states


The block_forward implementation is not fully compatible with Pipeline Parallelism (PP) when mtp_unroll_steps is used:

ZeroDivisionError: If a PP stage contains no MTP layers (len(self.layers) == 0), line 504 will crash due to step % physical_num_layers. A guard is needed to return early or skip the loop in such stages.

Incorrect Chunking and Offset: torch.chunk(hidden_states, 1 + offset, dim=0) and global_depth = offset + step + 1 rely on the physical layer offset. If previous stages have already performed logical unrolling, the number of chunks in hidden_states and the starting logical depth will be higher than what the physical offset indicates.

Redundant Execution: Every stage with MTP layers will attempt to execute the full unroll_steps. In a PP setup where MTP layers are distributed, this leads to an incorrect total number of logical steps and mismatched chunk counts in _postprocess.

If this feature is primarily intended for the mtp_num_layers=1 case (single shared layer), please add a check for physical_num_layers > 0 and consider documenting the PP limitations.

gemini-code-assist · 2026-04-14T08:18:29Z

        else:
            # mrope or not packed_seq
-            rotary_pos_emb = torch.roll(rotary_pos_emb, shifts=-self.layer_number, dims=0)
+            rotary_pos_emb = torch.roll(rotary_pos_emb, shifts=-effective_depth, dims=0)


The torch.roll operation will fail if rotary_pos_emb is None. This occurs in models that do not use RoPE or MRoPE (e.g., models using absolute position embeddings).

if rotary_pos_emb is not None: rotary_pos_emb = torch.roll(rotary_pos_emb, shifts=-effective_depth, dims=0)

Good catch. This rotary_pos_emb is None assumption is pre-existing in the base branch; this PR only changes the shift depth from self.layer_number to effective_depth for the unrolled MTP case, and does not introduce a new dereference path here. If needed, I can address the None guard separately in a follow-up cleanup.

gemini-code-assist bot reviewed Apr 14, 2026

View reviewed changes

foolma(Codex) added 2 commits April 14, 2026 20:27

Fix lint issues for MTP reuse patch

e38e456

Fix multimodal MTP decoder input depth roll

704f809

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support MTP weight reuse with unrolled steps#29

Support MTP weight reuse with unrolled steps#29
MDR-EX1000 wants to merge 3 commits intomodelscope:release/1.1from
MDR-EX1000:mtp-reuse-patch

MDR-EX1000 commented Apr 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

gemini-code-assist bot Apr 14, 2026

Uh oh!

MDR-EX1000 Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MDR-EX1000 commented Apr 14, 2026

中文

前言

概述

变更内容

验证

说明

English

Background

Summary

What Changed

Validation

Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

MDR-EX1000 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant