[fsdp, megatron, trainer] fix: enhance mem footprint for forward_kl_topk OPD#6848
Conversation
|
Dmitrii Choklia seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
There was a problem hiding this comment.
Code Review
This pull request introduces a distillation_only flag to optimize memory usage and skip policy loss calculations during supervised top-k distillation. The changes span the PPO trainers, distillation loss calculations, and both FSDP and Megatron transformer engine implementations to conditionally omit log_probs computation. A potential issue was identified in the FSDP engine implementation (verl/workers/engine/fsdp/transformer_impl.py), where a destructive in-place modification of logits during log-probability calculation could corrupt the logits before they are processed by the distillation logits processor.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a "distillation_only" mode to optimize memory footprint and computation when performing supervised top-k distillation without policy gradients or task rewards. It skips the calculation, gathering, and propagation of log_probs across FSDP and Megatron engines, and adds corresponding unit tests to verify this behavior. Additionally, it empties the PyTorch cache before the Megatron optimizer step when top-k distillation is active to mitigate potential OOM issues.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
7482c90 to
e18691f
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a distillation_only mode to skip policy loss computation and log probability calculation when performing supervised top-k distillation without task rewards or policy gradients. This optimization reduces memory footprint and prevents potential out-of-memory (OOM) errors, particularly on tight VRAM setups. Key changes include updating the FSDP and Megatron transformer engines to conditionally bypass log probability computation, clearing the PyTorch cache before optimizer steps in Megatron when top-k distillation is active, and adding corresponding unit tests to verify the new behavior. I have no feedback to provide as there are no review comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
@wucong25 could you pls approve for CI to run ? |
|
@dimjava please fix CPU tests then feel free to @ me to re-trigger CI for you |
88908d6 to
1625f3d
Compare
|
@Luosuu pls run CI once again |
What does this PR do?
Reduce VRAM and compute for OPD (forward_kl_topk + use_policy_gradient=False + use_task_rewards=False) by skipping redundant full-vocab log_probs and PPO-loss work when only the top-k distillation loss is needed.
Introduces a distillation_only flag (set in the trainer, consumed by actor engines) and scopes empty_cache() on Megatron to top-k distillation steps.
Target config: distillation.distillation_loss.loss_mode=forward_kl_topk with supervised distillation
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
pytest tests/workers/test_megatron_distillation_only_on_cpu.py -q
pytest tests/workers/test_distillation_topk_symmetry_on_cpu.py -q
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.