Skip to content

Reduce downstream Megatron patching for RL use cases #4590

@sbhavani

Description

@sbhavani

Some Megatron Core features are difficult to use from external RL training loops without copying or monkey-patching GPTModel.forward, GPT postprocess, MTP postprocess, or 1F1B schedule plan.

This usually happens when the training loop owns data or semantics that Megatron Core should not model directly: selected-token labels, loss masks, packed sequence metadata, old/reference logprobs, KL/entropy terms, or custom fused logprob/loss computation.

Current downstream symptoms

  • veRL patches/copies GPT/MTP postprocess logic:
    • verl/models/mcore/mtp_patch.py
    • verl/models/mcore/model_forward_fused.py
    • verl/models/mcore/model_forward_1f1b_overlap.py
  • NeMo RL has patch_gpt_model_forward_for_linear_ce_fusion(...) in nemo_rl/distributed/model_utils.py, which monkey-patches GPTModel.forward to return selected-token logprobs from hidden states and output weights.

This indicates that external training loops need a stable, objective-neutral extension point at the GPT postprocess boundary.

Proposed direction

Add a small optional GPT output/postprocess hook, keyword-only. The hook should run after decoder hidden states are available and before the default output-layer logits/loss path. This should avoid adding PPO/GRPO/RL-specific arguments to GPTModel.forward.

Schedule-plan support

Thread the same optional processor/context through build_schedule_plan and the 1f1b schedule-plan PostProcessNode.

MTP follow-up

Handle MTP separately if needed. First investigate whether MTP can expose a narrow callable for custom loss/logprob computation while Megatron Core continues to own MTP shifting, packed-sequence handling, scaling, and logging behavior.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions