feature(zsa): add a minimal general ORM RL example on Geo3K#56
Conversation
|
这边补一份基于当前 PR 代码的完整训练验证汇报,使用的 run 是:
1. 本次验证的运行配置本次是用真实
2. 训练结果概览这次 run 是完整跑完的,最终达到了:
从 wandb 结果看,整体趋势是正常的,主要结论如下:
如果按当前 reward recipe 理解,这次的 reward 结构是清晰的:
因此这次总 reward 的上升,主要是 3. 关于优化信号的补充
4. 关键曲线Summary CardReward DashboardOptimization Dashboard5. 当前结论基于这次完整 run,我这边对当前 PR 代码的判断是:
如果后续需要,我可以再补一版:
|
|
补一份基于本 PR 当前代码、这次 full SGLang + RM engine 真实长训练的完整中文实验汇报。汇报形式参考 PR54:#54 1. 本次实验对应的真实运行
2. 关键启动配置这次不是本地 smoke,而是实际
另外,这次 worker 内在启动训练前显式补了
3. 核心结果结论
我这里对这次 run 的判断是:
4. 图表Summary CardReward DashboardOptimization Dashboard5. 从真实 trajectory 中抽样的 3 组样例下面 3 组都不是手写示意,而是直接从本次真实 run 的 trajectory 文件里抽的,分别覆盖:
Case A: Final-step correct sample
Case B: Partial reward from RM support
Case C: Format-only failure case
如果后续需要,我可以再补一版基于这次 full run 的 rule-only baseline 对照,或者把这次 run 里 |
|
补一条对真实题目 case 展示方式的修正说明。 上一个实验汇报 comment 里的样例是按“不同 reward 形态”选的,但更直接的展示方式其实应该是:固定同一道题,对比 step80 和 step320 的 infer 与 reward 变化。 我重新从这次 full run 的真实 trajectory 里筛了一遍,
这 4 张都直接来自真实 trajectory:
Question A: parallelogram area
这里的变化很典型:step80 时已经“很接近正确答案”,所以 Question B: tangent geometry y
这道题的变化更剧烈:step80 时基本只保住了格式,accuracy 和 general RM 都没给分;step320 时,回答变成了完整正确解,所以 另外,README 我也同步改了:
这两份 README 里的图片已经全部切到仓库内相对路径,资源放在:
这样文档本身不会依赖外部 attachment 链接。 |
| Geo3K-specific reward mixing logic that combines format, general-model, and | ||
| accuracy rewards during training and evaluation. | ||
| """ | ||
| from __future__ import annotations |
There was a problem hiding this comment.
https://github.com/opendilab/LightRFT/blob/main/examples/gsm8k_geo3k/reward_models_utils.py里面有的地方 直接从gsm8k_geo3k/reward_models_utils.py这里import吧,这个orm_rl_demo新加的才放到这个文件中
There was a problem hiding this comment.
涉及代码重构,暂在本 PR 内跳过,后续单独处理。
There was a problem hiding this comment.
详细说明一下现状和评估:
两边重叠的函数实际对比
extract_response、geo3k_accuracy_reward_fn、geo3k_format_reward_fn、gsm8k_accuracy_reward_fn、gsm8k_format_reward_fn 这几个纯规则函数的实现确实几乎一致,只有 docstring 措辞上的细微差异。
但 mix_rewards、reward_fn、load_reward_models 这三个核心函数两边实现差异较大:gsm8k_geo3k 版本是纯规则路径(model_reward_list 永远为空),orm_rl_demo 版本要支持真实 neural RM 的分数融合,两者不能直接互换。
如果真要做,实际操作是什么
因为 examples/ 下的脚本不是 Python package(没有 __init__.py),跨目录 import 只能靠 sys.path.insert,会在两个 example 之间引入隐式依赖。这条路本身就不干净。
真正干净的做法是:把共用的纯工具函数(accuracy/format reward fn、extract_response 等)统一迁移进 lightrft/ 包的某个子模块(比如 lightrft/utils/reward_utils.py),让两边都从 lightrft import。但这涉及改动 lightrft 包本体、补测试、两个 example 的 import 同步调整,范围比只动 example 大。
当前的代价和风险评估
目前两个 example 有意设计成相对独立,orm_rl_demo 的这份 reward_models_utils.py 包含了 RM 加载、engine 配置、reward 融合等 demo 专有逻辑,和 gsm8k_geo3k 的 rule-only 版本功能上并不对称,强行合并反而可能让两边都难维护。
建议这里先保持现状,如果后续要统一 reward utility 的架构,比较合适的方式是单独开一个 issue 来设计这部分(哪些函数应该进 lightrft/ 包、接口怎么定),以免在这个 PR 里顺带做一个风险更大的改动。
| from reward_models_utils import load_reward_models, reward_fn, RECIPE | ||
|
|
||
|
|
||
| def _apply_label_override(dataset, label_key: str, label_override: str, strategy, dataset_name: str): |
There was a problem hiding this comment.
除了_apply_label_override这个 下面的def train(args):是否可以复用https://github.com/opendilab/LightRFT/blob/main/examples/gsm8k_geo3k/train_colocate.py#L68这里的呢?看看怎么保持代码的简洁和可扩展性哈
There was a problem hiding this comment.
涉及代码重构(与 gsm8k_geo3k 的 train 函数复用),暂在本 PR 内跳过,后续单独处理。
There was a problem hiding this comment.
详细说明一下现状和评估:
两个 train() 的实际差异在哪
orm_rl_demo 的 train() 和 gsm8k_geo3k 的 train() 主体结构一致,核心差异只有两处:
- prompt dataset 和 eval dataset 加载后各有一次
_apply_label_override调用——这是 orm_rl_demo 专有的逻辑(运行时把 geo3k 的 label 覆盖成geo3k_general,以走 general ORM reward 融合路径),gsm8k_geo3k 没有这个需求。 - 一些 critic FSDP 分支和 logging 的细节差异(gsm8k 版本还多了
torch.multiprocessing.set_sharing_strategy等)。
如果真要复用,实际操作是什么
最直接的办法是给 gsm8k_geo3k 的 train() 加一个可选的 dataset_transform: Optional[Callable] = None 参数,orm_rl_demo 传入一个包了 _apply_label_override 的 callable。但这是把 orm 专有的 hook 需求往更简单的 example 里渗透,方向是反的。
另一条路是把公共 train() 抽成 lightrft/ 包里的一个可扩展基类,两边 example 各自继承并 override 数据预处理步骤。这个设计是合理的长期方向,但属于框架层改动,代价和 review 风险都比这个 PR 本身大。
当前建议
orm_rl_demo 的 train() 已经跑通了完整的 2 卡全量训练,当前保持独立复制的方式是最低风险的选项。如果后续要统一 example 的 train() 架构(支持 dataset transform hook),比较合适的方式是单独开 issue 来设计,避免在这个 PR 里顺带引入更大范围的改动。
…x format - Rename assets/verified_full_run_20260417 -> assets/exp_20260417 - Rewrite README_zh.md: user-facing tone, merge overview sections, restructure experiment results into 实验设置/整体曲线结果/案例分析, clarify reward formula (general_model coefficient 0.2 vs raw ORM output range) - Rewrite README.md with equivalent English changes - Convert all reward_models.py docstrings from Google Args:/Returns: style to Sphinx :param/:type/:return/:rtype format to match repo convention Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ics to [0, 1]
Previously, the logged metrics were weighted contributions rather than
raw scores:
- general_model_reward was logged as 0.2 × ORM_score, capping at 0.2
- rule_reward was logged as 0.1×format + 0.7×accuracy, ranging [0, 0.8]
The final_reward computation (used for training) is unchanged:
final = 0.1×fmt + 0.2×orm_score + 0.7×acc
Only the metrics dict values are corrected:
- general_model_reward now logs the raw ORM output {0.0, 0.5, 1.0}
- rule_reward now logs (0.1×fmt + 0.7×acc) / 0.8, normalizing to [0, 1]
while preserving the relative weighting between format and accuracy
Verified with a 2-GPU run: metrics now fall in [0, 1] and final_reward
values are unchanged from previous runs.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
|
修复: 之前这两个 metric 记录的是加权后的贡献量而非原始分,导致数值范围偏小、不直观:
训练计算完全不受影响。 没有任何改动,只有 验证结果(2 GPU 真实运行):
commit:cb2c20e |
| num_per_rank = len(texts) // model._tp_size | ||
| texts = texts[model._tp_rank * num_per_rank : (model._tp_rank + 1) * num_per_rank] | ||
| else: | ||
| from vllm import SamplingParams |
There was a problem hiding this comment.
这里是有意的懒 import。vllm 并非所有运行环境下都一定安装(只走 sglang 的环境没有 vllm),如果放到文件顶部会在这类环境里直接抛 ImportError,导致整个模块无法导入。目前的做法是把 from vllm import SamplingParams 放在实际走 vllm 分支时才执行,避免强依赖。如果后续需要,可以考虑顶部加 try: from vllm import SamplingParams\nexcept ImportError: SamplingParams = None 的写法,但会引入额外的 None 判断,权衡下来暂时保持懒 import 更干净。
- README: fix "reward engine" -> "SGLang" in engine description (FSDP and SGLang are now consistent concepts on the same level) - README: drop redundant "template / no hardcoded paths" sentence; the /path/to/ placeholders in the code block already make this obvious - train_colocate.py: remove verbose banner comment block above init_model_context; the short "# configure model" line is sufficient Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Brings in: - examples/orm_rl_demo: a new general ORM RL example on Geo3K (PR opendilab#56) - minor stat-aggregation changes in spmd_ppo_trainer / ppo_trainer_vl - strategy_base reward-model shard_size = world_size Conflict resolution (3 files): - lightrft/strategy/strategy_base.py: kept HEAD's _resolve_fsdp_shard_size helper (more robust for non-divisible world sizes). - lightrft/trainer/ppo_trainer_vl.py: kept HEAD's generic reward_metric_values = defaultdict(list) collection (subsumes the named-list approach upstream introduced, and is required for the PRM variant 2 diagnostics). - lightrft/trainer/spmd_ppo_trainer.py: kept HEAD's compact rollout-stats aggregation; upstream's elaborate named-block was redundant with the later "Detailed Step Statistics" section HEAD already has. - lightrft/trainer/fast_exp_maker.py: removed duplicate references= kwarg auto-merge fluke. Verification: - python3 ast.parse on all 4 conflict files → clean - python3 -m unittest examples.math_prm.tests.test_ursa_variant2 -v → Ran 9 tests in 0.050s, OK
…sults Resolves all 9 C-severity and 5 I-severity findings from opendilab#53 (comment) C — Critical (blocking) — fixed: - Delete 7 debug-only smoke / fix-verify scripts that hardcoded /home/ubuntu, /mnt/.../puyuan, and /home/ubuntu/miniconda3/.../torchrun: run_grpo_smoke_misalign_fix.sh run_smoke_base_eval_only.sh run_smoke_eval_fix_verify.sh run_smoke_padding_fix_verify.sh run_smoke_per_step_prm.sh run_smoke_per_step_prm_groupnorm.sh run_smoke_paper_variant2.sh Only the two production launchers ship now (PS-GRPO + variant 2). - tools/prepare_ursa_stage3_manifest.py: drop /home/ubuntu/... defaults for --input-path / --image-root; both are now required=True so a fresh user gets a clear missing-arg error instead of silently targeting someone else's home directory. - run_grpo_math_prm_ursa_8b_variant2.sh + run_grpo_math_prm_ursa_8b.sh: add `set -eo pipefail` at the top so a crashed torchrun propagates its exit code through the `2>&1 | tee` pipeline (previously, tee's success masked torchrun crashes and orchestrators saw a green run). I — Important (blocking) — fixed: - examples/math_prm/assets/exp_20260603/{eval_outcome,kl_and_rollout, eval_quality,variant2_health}.png: 4 W&B-derived figures from the 9-day production run, matching the orm_rl_demo/assets/exp_*/ pattern established in PR opendilab#56. - README.md / README_zh.md: add §7 "Results — 9-day production run" section with eval-outcome table, W&B run link, and the 4 figures. - README.md / README_zh.md: add §6 "Strict Paper Eq.9 — variant 2 path" section (formula, math_per_step_prm workflow, sed-relabel command, unit-test invocation) — previously the variant-2 launcher shipped without any README coverage. - README.md / README_zh.md: update §8 files tree to match git ls-files (adds ursa_variant2.py, test_ursa_variant2.py, assets/; removes the 7 deleted smoke scripts). - README.md / README_zh.md: also add Available labels row for math_per_step_prm; expand the "What's Logged" section to include the 13 PRM diagnostic fields + the 7 variant-2 ursa_v2_* fields. - run_grpo_math_prm_ursa_8b.sh:236: switch from `> "${TRAIN_LOG}" 2>&1` to `2>&1 | tee "${TRAIN_LOG}"` so tmux operators see live training output (matches orm_rl_demo / r1_aqa / gsm8k_geo3k launcher convention). - test_ursa_variant2.py: move from examples/math_prm/tests/ to examples/math_prm/ top level to match every other example in the repo. Path-resolution fixed accordingly. M — Minor (non-blocking) — also addressed: - run_grpo_math_prm_ursa_8b_variant2.sh header docstring rewritten to describe variant 2 (was previously a verbatim PS-GRPO copy/paste). - train_colocate.py:28 docstring "usage: python train_grpo_rm_colocate.py" corrected to "python examples/math_prm/train_colocate.py". Verification: $ python3 -m unittest examples.math_prm.test_ursa_variant2 -v Ran 9 tests in 0.034s — OK











Summary
This draft PR adds a new
orm_rl_demoexample / experiment for minimal ORM-based RL training in LightRFT.The target example is a single runnable path similar to:
The intended experiment is:
This PR remains draft because final follow-up review is still pending.
Type of Change
Related Issues
What This Draft Adds
This draft is intended to add one focused example / experiment rather than a broader multi-purpose training surface.
The intended end state is:
examples/orm_rl_demoTesting
Environment:
Command(s):
Results:
Integration Notes
This draft branch has already been synced with the latest upstream
main.While finishing the example cleanup, the main runtime areas worth paying attention to are still:
experience_maker_vlandtrajectory_saverbroadcast_utilsfast_exp_makerandtrainer/utils.pygsm8k_geo3krule-reward pathTODO Before Marking This Ready
orm_rl_demoexample / experiment.main.orm_rl_demonaming.gsm8k_geo3k.Checklist
mainhas been merged into this branch.orm_rl_demo.