Some Megatron Core features are difficult to use from external RL training loops without copying or monkey-patching GPTModel.forward, GPT postprocess, MTP postprocess, or 1F1B schedule plan.
This usually happens when the training loop owns data or semantics that Megatron Core should not model directly: selected-token labels, loss masks, packed sequence metadata, old/reference logprobs, KL/entropy terms, or custom fused logprob/loss computation.
Current downstream symptoms
- veRL patches/copies GPT/MTP postprocess logic:
- verl/models/mcore/mtp_patch.py
- verl/models/mcore/model_forward_fused.py
- verl/models/mcore/model_forward_1f1b_overlap.py
- NeMo RL has patch_gpt_model_forward_for_linear_ce_fusion(...) in nemo_rl/distributed/model_utils.py, which monkey-patches GPTModel.forward to return selected-token logprobs from hidden states and output weights.
This indicates that external training loops need a stable, objective-neutral extension point at the GPT postprocess boundary.
Proposed direction
Add a small optional GPT output/postprocess hook, keyword-only. The hook should run after decoder hidden states are available and before the default output-layer logits/loss path. This should avoid adding PPO/GRPO/RL-specific arguments to GPTModel.forward.
Schedule-plan support
Thread the same optional processor/context through build_schedule_plan and the 1f1b schedule-plan PostProcessNode.
MTP follow-up
Handle MTP separately if needed. First investigate whether MTP can expose a narrow callable for custom loss/logprob computation while Megatron Core continues to own MTP shifting, packed-sequence handling, scaling, and logging behavior.
Some Megatron Core features are difficult to use from external RL training loops without copying or monkey-patching GPTModel.forward, GPT postprocess, MTP postprocess, or 1F1B schedule plan.
This usually happens when the training loop owns data or semantics that Megatron Core should not model directly: selected-token labels, loss masks, packed sequence metadata, old/reference logprobs, KL/entropy terms, or custom fused logprob/loss computation.
Current downstream symptoms
This indicates that external training loops need a stable, objective-neutral extension point at the GPT postprocess boundary.
Proposed direction
Add a small optional GPT output/postprocess hook, keyword-only. The hook should run after decoder hidden states are available and before the default output-layer logits/loss path. This should avoid adding PPO/GRPO/RL-specific arguments to GPTModel.forward.
Schedule-plan support
Thread the same optional processor/context through build_schedule_plan and the 1f1b schedule-plan PostProcessNode.
MTP follow-up
Handle MTP separately if needed. First investigate whether MTP can expose a narrow callable for custom loss/logprob computation while Megatron Core continues to own MTP shifting, packed-sequence handling, scaling, and logging behavior.