[fully_async, reward, docs] fix: skip reward function call when use_task_rewards=False#6870
[fully_async, reward, docs] fix: skip reward function call when use_task_rewards=False#6870kenkenpa2126 wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a configuration option 'use_task_rewards' to conditionally skip reward computation ('_compute_score') during agent loop post-processing. This is useful when training with custom datasets that do not have a compatible reward function. The documentation in 'docs/algo/opd.md' has been updated to reflect this behavior. There are no review comments, so we have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
15180b4 to
f5a0c7e
Compare
When training with OPD (distillation.enabled=True) and distillation_loss.use_task_rewards=False, the agent_loop still called _compute_score unconditionally. If the dataset's data_source was not registered in default_compute_score, this raised NotImplementedError and crashed training before the first step. The use_task_rewards flag only suppressed the task reward contribution to the loss (in distillation/losses.py) but did not prevent the reward function from being invoked. OPD trains entirely on the distillation loss; calling the reward function is unnecessary and error-prone when using custom datasets. Fix: read use_task_rewards from distillation config in __init__ and skip _compute_score when it is False. Co-authored-by: Claude <noreply@anthropic.com>
…entirely When use_task_rewards=false, _compute_score is not invoked at all, not just zeroed in the loss. This matters for custom datasets whose data_source is not registered in default_compute_score. Co-authored-by: Claude <noreply@anthropic.com>
f5a0c7e to
1158676
Compare
|
@kenkenpa2126 Which trainer do you use? In new V1 trainer, |
…iner compat When use_task_rewards=False, _compute_score() is skipped and reward_score stays None, so rm_scores is never added to the output batch. V1 trainer's _compute_advantage unconditionally reads rm_scores from TransferQueue even in distillation-only setups, causing a KeyError. Fix by setting reward_score=0.0 in the else branch so rm_scores is always populated (as all-zeros at the last response token) regardless of trainer. Co-authored-by: Claude <noreply@anthropic.com>
…rewards=False Regression test for the fix in e242dae. Verifies that when use_task_rewards=False, _agent_loop_postprocess sets reward_score=0.0 (rather than leaving it None), so that _postprocess can populate rm_scores in the training batch — required by V1 trainer's _compute_advantage. Co-authored-by: Claude Signed-off-by: Kenta Izumi <kenta.izumi@bytedance.com>
|
@wuxibin89 You're right that |
What does this PR do?
Fixes a crash in OPD training when distillation.distillation_loss.use_task_rewards=False and the dataset's data_source is not registered in
default_compute_score.The
use_task_rewardsflag previously only suppressed the task-reward contribution to the loss (in distillation/losses.py) but did not prevent_compute_scorefrom being invoked inagent_loop.py. When using a custom dataset whosedata_sourceis not registered in the default reward registry, this raisedNotImplementedErrorand crashed training before the first step.OPD with
use_task_rewards=Falsetrains entirely on the distillation loss; invoking the reward function is unnecessary and error-prone with custom datasets.Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
Verified under fully async OPD training with a custom QA dataset (data_source not registered in default_compute_score). Previously crashed at the first distillation step with
NotImplementedError; now runs without invoking the reward function whenuse_task_rewards=False.API and Usage Example
No API change. Existing config usage:
Design & Code Changes
_compute_scorecall with ifself.use_task_rewards.use_task_rewards=falseskips the reward function invocation entirely, not just the loss contribution.Checklist Before Submitting
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.