Skip to content

feature(zsa): add a minimal general ORM RL example on Geo3K#56

Merged
puyuan1996 merged 21 commits into
opendilab:mainfrom
HansBug:dev/st
Apr 29, 2026
Merged

feature(zsa): add a minimal general ORM RL example on Geo3K#56
puyuan1996 merged 21 commits into
opendilab:mainfrom
HansBug:dev/st

Conversation

@HansBug

@HansBug HansBug commented Apr 9, 2026

Copy link
Copy Markdown
Member

Summary

This draft PR adds a new orm_rl_demo example / experiment for minimal ORM-based RL training in LightRFT.

The target example is a single runnable path similar to:

bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh

The intended experiment is:

  • dataset: Geo3K
  • base actor: Qwen2.5-VL-7B
  • outcome reward model: Qwen2.5-VL-72B general ORM
  • goal: clarify the core ORM RL workflow with a small end-to-end setup that is easier to understand, easier to run, and easier to debug

This PR remains draft because final follow-up review is still pending.

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • 🎨 Refactoring (code style, formatting, local variables)
  • Performance (improvements to code performance)
  • Testing (adding or fixing tests)
  • 📚 Documentation (updates to documentation)
  • 💥 Breaking change (fix or feature that causes existing functionality to fail)

Related Issues

  • Fixes #
  • Related to #

What This Draft Adds

This draft is intended to add one focused example / experiment rather than a broader multi-purpose training surface.

The intended end state is:

  • one minimal example path under examples/orm_rl_demo
  • one Geo3K-oriented general ORM RL entrypoint
  • one base Qwen2.5-VL-7B actor for trajectory generation
  • one Qwen2.5-VL-72B general outcome reward model for trajectory scoring
  • one smaller review surface for understanding the ORM RL loop clearly

Testing

Environment:

  • Python: local runtime validation completed for the narrowed demo
  • PyTorch: local runtime validation completed for the narrowed demo
  • CUDA: local runtime validation completed for the narrowed demo

Command(s):

bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh

Results:

  • Tests passed locally

Integration Notes

This draft branch has already been synced with the latest upstream main.

While finishing the example cleanup, the main runtime areas worth paying attention to are still:

  • rollout metadata flow through experience_maker_vl and trajectory_saver
  • FSDP / engine weight synchronization in broadcast_utils
  • multimodal generation flow in fast_exp_maker and trainer/utils.py
  • reward-side response extraction / prompt-contamination risks already addressed upstream in the gsm8k_geo3k rule-reward path

TODO Before Marking This Ready

  • Define the experiment scope around Geo3K + base Qwen2.5-VL-7B actor + one Qwen2.5-VL-72B general ORM.
  • Narrow the intended PR direction to one minimal orm_rl_demo example / experiment.
  • Collect the initial draft materials for this example / experiment in the branch.
  • Merge or rebase from the latest upstream main.
  • Replace remaining old naming with orm_rl_demo naming.
  • Trim the branch down to one clear example path and one primary run entry.
  • Remove extra scripts and descriptions outside the experiment scope.
  • Validate that the basic ORM RL workflow runs end-to-end.
  • Check that reward behavior is reasonably convergent.
  • Review and trim any remaining non-demo assumptions that show up during validation.
  • Confirm whether the demo reward path needs the same assistant-only response extraction hardening already added upstream for gsm8k_geo3k.
  • Polish documentation and command examples after the scope cleanup.

Checklist

  • The PR is now scoped as one minimal Geo3K general ORM RL example / experiment.
  • The intended setup is a base Qwen2.5-VL-7B actor plus one Qwen2.5-VL-72B general outcome reward model.
  • The latest upstream main has been merged into this branch.
  • Initial draft materials for the example / experiment have been collected into this draft branch.
  • Remaining naming has been unified to orm_rl_demo.
  • The example has been reduced to one clear primary run entry.
  • Extra scripts and descriptions have been removed.
  • The basic workflow has been validated end-to-end.
  • Reward behavior has been checked for reasonable convergence.
  • This PR is ready for review.

@HansBug HansBug self-assigned this Apr 9, 2026
@HansBug HansBug added the enhancement New feature or request label Apr 9, 2026
@HansBug HansBug requested a review from puyuan1996 April 9, 2026 06:14
@HansBug HansBug added the documentation Improvements or additions to documentation label Apr 9, 2026
@HansBug HansBug changed the title feature(safework): migrate svkng pipeline from cluster into runnable example feature(orm_rl_demo): narrow scope to a minimal general ORM RL demo Apr 9, 2026
@HansBug HansBug changed the title feature(orm_rl_demo): narrow scope to a minimal general ORM RL demo feature(safework): migrate svkng pipeline from cluster into runnable example Apr 9, 2026
@HansBug HansBug changed the title feature(safework): migrate svkng pipeline from cluster into runnable example feature(orm_rl_demo): add a minimal general ORM RL example on Geo3K Apr 9, 2026
Comment thread examples/orm_rl_demo/test_reward_models.py Outdated
Comment thread examples/orm_rl_demo/test_reward_models.py Outdated
Comment thread examples/orm_rl_demo/test_reward_models.py Outdated
Comment thread lightrft/strategy/vllm_utils/vllm_worker_wrap_no_ray.py
Comment thread examples/orm_rl_demo/reward_models_utils.py Outdated
Comment thread examples/orm_rl_demo/reward_models.py Outdated
Comment thread examples/orm_rl_demo/train_colocate.py Outdated
@puyuan1996 puyuan1996 changed the title feature(orm_rl_demo): add a minimal general ORM RL example on Geo3K feature(zsa): add a minimal general ORM RL example on Geo3K Apr 14, 2026
Comment thread lightrft/strategy/strategy_base.py
Comment thread lightrft/trainer/fast_exp_maker.py
Comment thread examples/orm_rl_demo/train_colocate.py Outdated
Comment thread examples/orm_rl_demo/train_colocate.py Outdated
Comment thread examples/orm_rl_demo/run_general_fsdp_qwenvl.sh Outdated
Comment thread examples/orm_rl_demo/run_general_fsdp_qwenvl.sh Outdated
Comment thread examples/orm_rl_demo/run_general_fsdp_qwenvl.sh Outdated
Comment thread examples/orm_rl_demo/README_zh.md Outdated
@puyuan1996 puyuan1996 marked this pull request as ready for review April 15, 2026 02:56
@HansBug

HansBug commented Apr 17, 2026

Copy link
Copy Markdown
Member Author

这边补一份基于当前 PR 代码的完整训练验证汇报,使用的 run 是:

1. 本次验证的运行配置

本次是用真实 rlaunch 在当前评测镜像下跑的一次完整 2 GPU 长训练验证,核心配置如下:

  • 资源:2 GPU / 40 CPU / 500000 memory
  • 镜像:registry.h.pjlab.org.cn/ailab-rlinfra-rlinfra_gpu/easyr1:lightrft-20260119
  • 数据:/mnt/shared-storage-user/puyuan/data/geo3k
  • actor:/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
  • general RM:/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
  • rollout engine:vllm
  • RM 推理:rm_use_engine=True,backend 走 vllm
  • label override:geo3k_general
  • reward mixing:format/general_model/accuracy = 0.1 / 0.2 / 0.7
  • num_episodes=20
  • train_batch_size=128
  • rollout_batch_size=128
  • micro_train_batch_size=4
  • micro_rollout_batch_size=4
  • n_samples_per_prompt=8
  • prompt_max_len=1024
  • generate_max_len=2048
  • actor_learning_rate=1e-6
  • init_kl_coef=0.001
  • lr_warmup_ratio=0.03
  • eval_steps=20
  • max_eval_samples=700
  • zero_stage=3
  • bf16=True
  • gradient_checkpointing=True
  • freeze_prefix=True
  • adam_offload=True
  • flash_attn=True
  • save_trajectories=Truemax_ckpt_num=1

2. 训练结果概览

这次 run 是完整跑完的,最终达到了:

  • train/global_step = 320
  • eval 一共触发 16 次,对应 train_step = 20, 40, ..., 320

从 wandb 结果看,整体趋势是正常的,主要结论如下:

  • eval/reward_mean0.4587 提升到 0.5736,绝对提升 +0.1149,相对提升约 +25.0%
  • best eval/reward_mean 出现在 train_step=240,达到 0.5793
  • final 相比 best 只回落了 0.0057,说明后半程基本进入平台区,而不是明显崩掉
  • eval/accuracy_reward_mean0.3936 提升到 0.5225,是这次提升的主要来源
  • eval/format_reward_mean 从一开始就在 0.99 左右,后面基本稳定,说明 format 约束已经比较早收敛
  • eval/general_model_reward_mean0.0842 提升到 0.1086,是稳定正增益,不是 0
  • train/general_model_reward_mean0.0600 提升到 0.1488
  • train/accuracy_reward_mean0.2734 提升到 0.7168
  • train/format_reward_mean0.5313 提升到 0.9844
  • train/response_length_mean333.5 下降到 276.5eval/response_length_mean 最终稳定在 268.8 左右

如果按当前 reward recipe 理解,这次的 reward 结构是清晰的:

  • rule_reward_mean = 0.1 * format_reward + 0.7 * accuracy_reward
  • step_reward_mean = rule_reward_mean + 0.2 * general_model_reward

因此这次总 reward 的上升,主要是 accuracy_reward 拉动,general_model_reward 提供了额外的正向增益,format_reward 更多是在高位稳定。

3. 关于优化信号的补充

  • train/kl 最终值是 0.4442
  • 原始 history 里存在少量 KL spike,raw max 是 216.2680
  • 为了不让极端点把主趋势压扁,下面的 KL 图做了 p99 clipping
  • 从主趋势看,训练并不是全程失控;更像是中间出现了少量尖峰,但最终仍然回到可接受区间

4. 关键曲线

Summary Card

Reward Dashboard

Optimization Dashboard

5. 当前结论

基于这次完整 run,我这边对当前 PR 代码的判断是:

  • 这条 Geo3K + general ORM RL demo 链路在真实 2 GPU rlaunch 环境下已经可以完整跑通
  • 当前默认采用 vllm 作为 rollout / RM engine backend 的配置是可验通的
  • 从结果上看,训练不是“只起得来”,而是确实学到了东西:accuracy rewardgeneral model rewardtotal eval reward 都有明确提升
  • format reward 早期就已经接近饱和,因此这次后续增益主要不是来自 format,而是来自 accuracy 和一部分 general RM
  • 在当前镜像/runtime 约束下,这份 demo 至少已经具备“可以交付一个能跑、能收敛、能给出可解释 wandb 指标”的状态

如果后续需要,我可以再补一版:

  • 和 rule-only Geo3K baseline 的对照表
  • 不同 reward mixing 权重的对照实验
  • 针对 KL spike 的进一步排查结论

@HansBug

HansBug commented Apr 18, 2026

Copy link
Copy Markdown
Member Author

补一份基于本 PR 当前代码、这次 full SGLang + RM engine 真实长训练的完整中文实验汇报。汇报形式参考 PR54:#54

1. 本次实验对应的真实运行

  • W&B run: https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw
  • run name: ORM-RL-Demo-Geo3K-General-SGLang-20260417_150451
  • state: finished
  • worker launch script: /mnt/shared-storage-user/zhangshaoang/.orm_rl_demo_full_sglang_20260417.sh
  • raw train log: /mnt/shared-storage-user/zhangshaoang/.orm_rl_demo_full_sglang_20260417_150345.log
  • save dir: /mnt/shared-storage-user/zhangshaoang/LightRFT/results/orm-rl-demo-general-geo3k-sglang/LightRFT-geo3k-general-orm-sglang-len_1024_2048-tbs_128-rbs_128-sample_8-kl_0.001-warmup_0.03-ep_20-lr_1e-6-20260417_150451
  • trajectory dir: /mnt/shared-storage-user/zhangshaoang/LightRFT/results/orm-rl-demo-general-geo3k-sglang/LightRFT-geo3k-general-orm-sglang-len_1024_2048-tbs_128-rbs_128-sample_8-kl_0.001-warmup_0.03-ep_20-lr_1e-6-20260417_150451/trajectories

2. 关键启动配置

这次不是本地 smoke,而是实际 rlaunch 起的 2 卡长训练。核心配置如下:

  • 资源:2 GPU / 40 CPU / 500000 memory
  • 镜像:registry.h.pjlab.org.cn/ailab-rlinfra-rlinfra_gpu/easyr1:lightrft-20260119
  • Conda 环境:/root/miniconda3/envs/lightrft
  • actor:/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
  • general RM:/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
  • 数据:/mnt/shared-storage-user/puyuan/data/geo3k
  • rollout engine:sglang
  • RM:rm_use_engine=True,backend 走 sglang
  • reward mixing:format 0.1 + general_model 0.2 + accuracy 0.7
  • train_batch_size=128, rollout_batch_size=128
  • micro_train_batch_size=4, micro_rollout_batch_size=4
  • n_samples_per_prompt=8, num_episodes=20
  • prompt_max_len=1024, generate_max_len=2048
  • actor_learning_rate=1e-6, init_kl_coef=0.001, lr_warmup_ratio=0.03
  • max_ckpt_num=1, save_trajectories=True, num_trajectories_to_save=16

另外,这次 worker 内在启动训练前显式补了 sglang 所需 runtime 环境,核心是:

  • conda activate /root/miniconda3/envs/lightrft
  • PYTHONPATH=/mnt/shared-storage-user/zhangshaoang/LightRFT:$PYTHONPATH
  • LD_LIBRARY_PATH 额外补入:
    • /usr/local/nvidia/lib
    • /usr/local/nvidia/lib64
    • /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
    • /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cudnn/lib
    • /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cublas/lib
    • /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cuda_nvrtc/lib
    • /root/miniconda3/envs/lightrft/lib

3. 核心结果结论

  • 训练完整跑完,最终 train/global_step = 320
  • 本次一共触发 eval = 16
  • eval/reward_mean0.4636 提升到 0.5679
  • best eval/reward_mean = 0.5686,出现在 train_step = 260
  • final eval/accuracy_reward_mean = 0.5166
  • final eval/format_reward_mean = 0.9956
  • final eval/general_model_reward_mean = 0.1067
  • final train/general_model_reward_mean = 0.1309
  • final train/step_reward_mean = 0.6883
  • final train/kl = 0.5952

我这里对这次 run 的判断是:

  • 当前 PR 代码在真实 2 GPU rlaunch 环境下,Geo3K + ORM RL demo + SGLang rollout + SGLang RM engine 这条链路已经可以完整跑通。
  • 不只是“能启动”,而是 reward 曲线整体正常,accuracy rewardgeneral_model_reward 都有正向提升。
  • format_reward 很早就接近饱和,后续总 reward 的主要增益来自 accuracy,同时 general_model_reward 提供了额外加分。

4. 图表

Summary Card

Reward Dashboard

Optimization Dashboard

5. 从真实 trajectory 中抽样的 3 组样例

下面 3 组都不是手写示意,而是直接从本次真实 run 的 trajectory 文件里抽的,分别覆盖:

  • 最终阶段正确样例
  • 被 general RM 给到部分正向加分、但 accuracy 仍为 0 的样例
  • 只有 format 过关、其余 reward 都没拿到的失败样例

Case A: Final-step correct sample

  • Source: trajectories_step_320.json, idx=0, image images/step320_exp0_sample0_img0.png
  • Prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
  • Output excerpt: ... The area of the parallelogram is approximately \boxed{39.0}.
  • Reward breakdown: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

Case B: Partial reward from RM support

  • Source: trajectories_step_80.json, idx=0, image images/step80_exp0_sample0_img0.png
  • Prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
  • Output excerpt: ... The area of the parallelogram is approximately 38.97 square feet. \boxed{38.97}
  • Reward breakdown: total=0.3, format=1.0, accuracy=0.0, general_model=0.2, rule=0.1
  • 这个 case 很典型:答案很接近正确值,但没有命中 accuracy 规则,因此总 reward 主要来自 format(0.1) + general_model(0.2)

Case C: Format-only failure case

  • Source: trajectories_step_160.json, idx=8, image images/step160_exp8_sample0_img0.png
  • Prompt: Find y. Assume that segments that appear to be tangent are tangent. Round to the nearest tenth if necessary.
  • Output excerpt: ... After calculating, we find that y = 10. </think> The radius y is \boxed{10}.
  • Reward breakdown: total=0.1, format=1.0, accuracy=0.0, general_model=0.0, rule=0.1
  • 这个 case 说明当前 reward mix 的最低保底就是 format reward;如果 answer 错了、general RM 也不给分,总 reward 就会停在 0.1

如果后续需要,我可以再补一版基于这次 full run 的 rule-only baseline 对照,或者把这次 run 里 rollout/train/eval 的更细日志统计单独整理成一个附录 comment。

@HansBug

HansBug commented Apr 18, 2026

Copy link
Copy Markdown
Member Author

补一条对真实题目 case 展示方式的修正说明。

上一个实验汇报 comment 里的样例是按“不同 reward 形态”选的,但更直接的展示方式其实应该是:固定同一道题,对比 step80 和 step320 的 infer 与 reward 变化

我重新从这次 full run 的真实 trajectory 里筛了一遍,step80step320 之间实际上只有 2 道共同题目,所以这里正好做成 4 张卡:

  • Question A @ step80
  • Question A @ step320
  • Question B @ step80
  • Question B @ step320

这 4 张都直接来自真实 trajectory:

  • Question A step80: trajectories_step_80.json, idx=0
  • Question A step320: trajectories_step_320.json, idx=0
  • Question B step80: trajectories_step_80.json, idx=8
  • Question B step320: trajectories_step_320.json, idx=8

Question A: parallelogram area

  • Shared prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
  • Step 80 output: ... 38.97 square feet. �oxed{38.97}
  • Step 320 output: ... �oxed{39.0}
  • Step 80 rewards: total=0.3, format=1.0, accuracy=0.0, general_model=0.2, rule=0.1
  • Step 320 rewards: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

这里的变化很典型:step80 时已经“很接近正确答案”,所以 general_model_reward 是正的,但因为没有命中规则答案,accuracy_reward 还是 0;到 step320 时,输出从 38.97 修正成规则答案 39.0,于是总 reward 从 0.3 直接跳到 1.0

Question B: tangent geometry y

  • Shared prompt: Find y. Assume that segments that appear to be tangent are tangent. Round to the nearest tenth if necessary.
  • Step 80 output: ... The radius y is �oxed{10}.
  • Step 320 output: ... �oxed{12.6}.
  • Step 80 rewards: total=0.1, format=1.0, accuracy=0.0, general_model=0.0, rule=0.1
  • Step 320 rewards: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

这道题的变化更剧烈:step80 时基本只保住了格式,accuracy 和 general RM 都没给分;step320 时,回答变成了完整正确解,所以 accuracy_rewardgeneral_model_reward 都转正。

另外,README 我也同步改了:

  • PR comment 里继续使用 GitHub attachment 链接,便于页面直接阅读
  • examples/orm_rl_demo/README.md
  • examples/orm_rl_demo/README_zh.md

这两份 README 里的图片已经全部切到仓库内相对路径,资源放在:

  • examples/orm_rl_demo/assets/verified_full_run_20260417/

这样文档本身不会依赖外部 attachment 链接。

@HansBug HansBug requested a review from puyuan1996 April 18, 2026 03:48
Comment thread examples/orm_rl_demo/README.md
Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/README_zh.md
Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/reward_models.py Outdated
Geo3K-specific reward mixing logic that combines format, general-model, and
accuracy rewards during training and evaluation.
"""
from __future__ import annotations

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/opendilab/LightRFT/blob/main/examples/gsm8k_geo3k/reward_models_utils.py里面有的地方 直接从gsm8k_geo3k/reward_models_utils.py这里import吧,这个orm_rl_demo新加的才放到这个文件中

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

涉及代码重构,暂在本 PR 内跳过,后续单独处理。

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

详细说明一下现状和评估:

两边重叠的函数实际对比

extract_responsegeo3k_accuracy_reward_fngeo3k_format_reward_fngsm8k_accuracy_reward_fngsm8k_format_reward_fn 这几个纯规则函数的实现确实几乎一致,只有 docstring 措辞上的细微差异。

mix_rewardsreward_fnload_reward_models 这三个核心函数两边实现差异较大:gsm8k_geo3k 版本是纯规则路径(model_reward_list 永远为空),orm_rl_demo 版本要支持真实 neural RM 的分数融合,两者不能直接互换。

如果真要做,实际操作是什么

因为 examples/ 下的脚本不是 Python package(没有 __init__.py),跨目录 import 只能靠 sys.path.insert,会在两个 example 之间引入隐式依赖。这条路本身就不干净。

真正干净的做法是:把共用的纯工具函数(accuracy/format reward fn、extract_response 等)统一迁移进 lightrft/ 包的某个子模块(比如 lightrft/utils/reward_utils.py),让两边都从 lightrft import。但这涉及改动 lightrft 包本体、补测试、两个 example 的 import 同步调整,范围比只动 example 大。

当前的代价和风险评估

目前两个 example 有意设计成相对独立,orm_rl_demo 的这份 reward_models_utils.py 包含了 RM 加载、engine 配置、reward 融合等 demo 专有逻辑,和 gsm8k_geo3k 的 rule-only 版本功能上并不对称,强行合并反而可能让两边都难维护。

建议这里先保持现状,如果后续要统一 reward utility 的架构,比较合适的方式是单独开一个 issue 来设计这部分(哪些函数应该进 lightrft/ 包、接口怎么定),以免在这个 PR 里顺带做一个风险更大的改动。

from reward_models_utils import load_reward_models, reward_fn, RECIPE


def _apply_label_override(dataset, label_key: str, label_override: str, strategy, dataset_name: str):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

涉及代码重构(与 gsm8k_geo3k 的 train 函数复用),暂在本 PR 内跳过,后续单独处理。

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

详细说明一下现状和评估:

两个 train() 的实际差异在哪

orm_rl_demo 的 train() 和 gsm8k_geo3k 的 train() 主体结构一致,核心差异只有两处:

  1. prompt dataset 和 eval dataset 加载后各有一次 _apply_label_override 调用——这是 orm_rl_demo 专有的逻辑(运行时把 geo3k 的 label 覆盖成 geo3k_general,以走 general ORM reward 融合路径),gsm8k_geo3k 没有这个需求。
  2. 一些 critic FSDP 分支和 logging 的细节差异(gsm8k 版本还多了 torch.multiprocessing.set_sharing_strategy 等)。

如果真要复用,实际操作是什么

最直接的办法是给 gsm8k_geo3k 的 train() 加一个可选的 dataset_transform: Optional[Callable] = None 参数,orm_rl_demo 传入一个包了 _apply_label_override 的 callable。但这是把 orm 专有的 hook 需求往更简单的 example 里渗透,方向是反的。

另一条路是把公共 train() 抽成 lightrft/ 包里的一个可扩展基类,两边 example 各自继承并 override 数据预处理步骤。这个设计是合理的长期方向,但属于框架层改动,代价和 review 风险都比这个 PR 本身大。

当前建议

orm_rl_demo 的 train() 已经跑通了完整的 2 卡全量训练,当前保持独立复制的方式是最低风险的选项。如果后续要统一 example 的 train() 架构(支持 dataset transform hook),比较合适的方式是单独开 issue 来设计,避免在这个 PR 里顺带引入更大范围的改动。

HansBug and others added 2 commits April 24, 2026 19:33
…x format

- Rename assets/verified_full_run_20260417 -> assets/exp_20260417
- Rewrite README_zh.md: user-facing tone, merge overview sections, restructure
  experiment results into 实验设置/整体曲线结果/案例分析, clarify reward formula
  (general_model coefficient 0.2 vs raw ORM output range)
- Rewrite README.md with equivalent English changes
- Convert all reward_models.py docstrings from Google Args:/Returns: style to
  Sphinx :param/:type/:return/:rtype format to match repo convention

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…ics to [0, 1]

Previously, the logged metrics were weighted contributions rather than
raw scores:
- general_model_reward was logged as 0.2 × ORM_score, capping at 0.2
- rule_reward was logged as 0.1×format + 0.7×accuracy, ranging [0, 0.8]

The final_reward computation (used for training) is unchanged:
  final = 0.1×fmt + 0.2×orm_score + 0.7×acc

Only the metrics dict values are corrected:
- general_model_reward now logs the raw ORM output {0.0, 0.5, 1.0}
- rule_reward now logs (0.1×fmt + 0.7×acc) / 0.8, normalizing to [0, 1]
  while preserving the relative weighting between format and accuracy

Verified with a 2-GPU run: metrics now fall in [0, 1] and final_reward
values are unchanged from previous runs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@HansBug

HansBug commented Apr 27, 2026

Copy link
Copy Markdown
Member Author

修复:general_model_rewardrule_reward metrics 归一化到 [0, 1]

之前这两个 metric 记录的是加权后的贡献量而非原始分,导致数值范围偏小、不直观:

metric 修复前 修复前范围 修复后 修复后范围
general_model_reward 0.2 × ORM分 [0, 0.2] ORM 原始输出 {0, 0.5, 1.0}
rule_reward 0.1×fmt + 0.7×acc [0, 0.8] (0.1×fmt + 0.7×acc) / 0.8 [0, 1]

训练计算完全不受影响。 final_reward 的公式:

final = 0.1 × format + 0.2 × orm_score + 0.7 × accuracy

没有任何改动,只有 metrics_dict 里的记录值被修正(该字典只用于日志和 trajectory 保存,不参与梯度计算)。

验证结果(2 GPU 真实运行):

  • general_model_reward_mean:0.274 ~ 0.399,显然已超过旧上限 0.2 ✅
  • rule_reward_mean:0.306 ~ 0.445,在 [0, 1] 范围内 ✅
  • rollout_reward:0.3 ~ 0.44,与修复前同级别,训练未受影响 ✅
  • 交叉验算:step 6 时 (0.1×0.959 + 0.7×0.371)/0.8 = 0.445 与日志完全吻合 ✅

commit:cb2c20e

@AltmanD AltmanD left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

中文 readme 中相关修改需要同步给英文 readme

Comment thread examples/orm_rl_demo/README_zh.md Outdated
Comment thread examples/orm_rl_demo/README_zh.md Outdated
num_per_rank = len(texts) // model._tp_size
texts = texts[model._tp_rank * num_per_rank : (model._tp_rank + 1) * num_per_rank]
else:
from vllm import SamplingParams

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import 统一放文件开头吧

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是有意的懒 import。vllm 并非所有运行环境下都一定安装(只走 sglang 的环境没有 vllm),如果放到文件顶部会在这类环境里直接抛 ImportError,导致整个模块无法导入。目前的做法是把 from vllm import SamplingParams 放在实际走 vllm 分支时才执行,避免强依赖。如果后续需要,可以考虑顶部加 try: from vllm import SamplingParams\nexcept ImportError: SamplingParams = None 的写法,但会引入额外的 None 判断,权衡下来暂时保持懒 import 更干净。

Comment thread examples/orm_rl_demo/train_colocate.py Outdated
- README: fix "reward engine" -> "SGLang" in engine description (FSDP and
  SGLang are now consistent concepts on the same level)
- README: drop redundant "template / no hardcoded paths" sentence; the
  /path/to/ placeholders in the code block already make this obvious
- train_colocate.py: remove verbose banner comment block above
  init_model_context; the short "# configure model" line is sufficient

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@puyuan1996 puyuan1996 merged commit 16db1b0 into opendilab:main Apr 29, 2026
1 check passed
HansBug added a commit to HansBug/LightRFT that referenced this pull request Jun 3, 2026
Brings in:
- examples/orm_rl_demo: a new general ORM RL example on Geo3K (PR opendilab#56)
- minor stat-aggregation changes in spmd_ppo_trainer / ppo_trainer_vl
- strategy_base reward-model shard_size = world_size

Conflict resolution (3 files):
- lightrft/strategy/strategy_base.py: kept HEAD's _resolve_fsdp_shard_size
  helper (more robust for non-divisible world sizes).
- lightrft/trainer/ppo_trainer_vl.py: kept HEAD's generic
  reward_metric_values = defaultdict(list) collection (subsumes the
  named-list approach upstream introduced, and is required for the PRM
  variant 2 diagnostics).
- lightrft/trainer/spmd_ppo_trainer.py: kept HEAD's compact rollout-stats
  aggregation; upstream's elaborate named-block was redundant with the
  later "Detailed Step Statistics" section HEAD already has.
- lightrft/trainer/fast_exp_maker.py: removed duplicate references=
  kwarg auto-merge fluke.

Verification:
- python3 ast.parse on all 4 conflict files → clean
- python3 -m unittest examples.math_prm.tests.test_ursa_variant2 -v
  → Ran 9 tests in 0.050s, OK
HansBug added a commit to HansBug/LightRFT that referenced this pull request Jun 3, 2026
…sults

Resolves all 9 C-severity and 5 I-severity findings from
opendilab#53 (comment)

C — Critical (blocking) — fixed:
  - Delete 7 debug-only smoke / fix-verify scripts that hardcoded
    /home/ubuntu, /mnt/.../puyuan, and /home/ubuntu/miniconda3/.../torchrun:
      run_grpo_smoke_misalign_fix.sh
      run_smoke_base_eval_only.sh
      run_smoke_eval_fix_verify.sh
      run_smoke_padding_fix_verify.sh
      run_smoke_per_step_prm.sh
      run_smoke_per_step_prm_groupnorm.sh
      run_smoke_paper_variant2.sh
    Only the two production launchers ship now (PS-GRPO + variant 2).
  - tools/prepare_ursa_stage3_manifest.py: drop /home/ubuntu/... defaults
    for --input-path / --image-root; both are now required=True so a
    fresh user gets a clear missing-arg error instead of silently
    targeting someone else's home directory.
  - run_grpo_math_prm_ursa_8b_variant2.sh + run_grpo_math_prm_ursa_8b.sh:
    add `set -eo pipefail` at the top so a crashed torchrun propagates
    its exit code through the `2>&1 | tee` pipeline (previously, tee's
    success masked torchrun crashes and orchestrators saw a green run).

I — Important (blocking) — fixed:
  - examples/math_prm/assets/exp_20260603/{eval_outcome,kl_and_rollout,
    eval_quality,variant2_health}.png: 4 W&B-derived figures from the
    9-day production run, matching the orm_rl_demo/assets/exp_*/ pattern
    established in PR opendilab#56.
  - README.md / README_zh.md: add §7 "Results — 9-day production run"
    section with eval-outcome table, W&B run link, and the 4 figures.
  - README.md / README_zh.md: add §6 "Strict Paper Eq.9 — variant 2 path"
    section (formula, math_per_step_prm workflow, sed-relabel command,
    unit-test invocation) — previously the variant-2 launcher shipped
    without any README coverage.
  - README.md / README_zh.md: update §8 files tree to match git ls-files
    (adds ursa_variant2.py, test_ursa_variant2.py, assets/; removes the
    7 deleted smoke scripts).
  - README.md / README_zh.md: also add Available labels row for
    math_per_step_prm; expand the "What's Logged" section to include
    the 13 PRM diagnostic fields + the 7 variant-2 ursa_v2_* fields.
  - run_grpo_math_prm_ursa_8b.sh:236: switch from
    `> "${TRAIN_LOG}" 2>&1` to `2>&1 | tee "${TRAIN_LOG}"` so tmux
    operators see live training output (matches orm_rl_demo /
    r1_aqa / gsm8k_geo3k launcher convention).
  - test_ursa_variant2.py: move from examples/math_prm/tests/ to
    examples/math_prm/ top level to match every other example in the
    repo. Path-resolution fixed accordingly.

M — Minor (non-blocking) — also addressed:
  - run_grpo_math_prm_ursa_8b_variant2.sh header docstring rewritten to
    describe variant 2 (was previously a verbatim PS-GRPO copy/paste).
  - train_colocate.py:28 docstring "usage: python train_grpo_rm_colocate.py"
    corrected to "python examples/math_prm/train_colocate.py".

Verification:
  $ python3 -m unittest examples.math_prm.test_ursa_variant2 -v
  Ran 9 tests in 0.034s — OK
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants