-
Notifications
You must be signed in to change notification settings - Fork 11
feature(zsa): add a minimal general ORM RL example on Geo3K #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
fd2588b
dev(hansbug): add math_prm from cluster
HansBug 3a3066c
chore(safework): sync runnable example from cluster
HansBug dc44de7
merge: sync upstream main into dev/st
HansBug 076be66
refactor(orm_rl_demo): rename safework example to orm_rl_demo
HansBug e14a64b
refactor(orm_rl_demo): narrow demo to one Geo3K general ORM entry
HansBug 2766906
fix orm rl demo 2gpu bringup
HansBug 218b89f
fix orm rl demo rlaunch bringup
HansBug b5c119e
fix orm rl demo trajectory analysis arg
HansBug 0e1efe9
fix orm rl demo reward engine bringup
HansBug 5bc2f3c
address orm rl demo pr review feedback
HansBug aa149c6
clarify general reward metric names
HansBug f9bb867
Fix ORM general RM engine prompts
HansBug fd09884
merge: sync upstream main into dev/st
HansBug c290079
fix: address orm rl demo review feedback
HansBug a8d338c
style: fix trainer yapf formatting
HansBug 8919ece
docs(orm_rl_demo): add full-run validation record
HansBug 3a9284b
docs(orm_rl_demo): store experiment figures in repo
HansBug b5a6a16
fix(orm_rl_demo): default demo script to sglang
HansBug 39fab41
docs(orm_rl_demo): restructure README and convert docstrings to Sphin…
HansBug cb2c20e
fix(orm_rl_demo): normalize general_model_reward and rule_reward metr…
HansBug 8696f3c
docs(orm_rl_demo): address AltmanD review comments
HansBug File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| <div align="center"> | ||
|
|
||
| # ORM RL Demo | ||
|
|
||
| Complete ORM trajectory-scoring RL training demo based on Geo3K. | ||
|
|
||
| </div> | ||
|
|
||
| ## Overview | ||
|
|
||
| This demo shows the full pipeline of using an ORM to score trajectories for RL training: | ||
| - dataset: Geo3K | ||
| - actor: Qwen2.5-VL 7B model | ||
| - reward: one general outcome reward model combined with rule-based accuracy reward and format reward, all contributing to the GRPO loss | ||
| - training engine: FSDP, inference engine: SGLang | ||
|
|
||
| The actor generates Geo3K trajectories, the general ORM scores them, and the scores are combined with a rule-based accuracy reward (`accuracy_reward`) and a format reward (`format_reward`) to compute the final GRPO loss. To avoid rewriting the Geo3K dataset files, the demo overrides the dataset label to `geo3k_general` at runtime so the original dataset path can be reused while routing through the general ORM reward mix. | ||
|
|
||
| Environment requirements stay aligned with the repository-level [README.md](../../README.md). Refer to the main project document. | ||
|
|
||
| ## Project Structure | ||
|
|
||
| ```text | ||
| orm_rl_demo/ | ||
| ├── train_colocate.py | ||
| ├── reward_models.py | ||
| ├── reward_models_utils.py | ||
| ├── test_reward_models.py | ||
| └── run_general_fsdp_qwenvl.sh | ||
| ``` | ||
|
|
||
| ## Quick Start | ||
|
|
||
| Set the data and model paths, then run the entry script: | ||
|
|
||
| ```bash | ||
| export DATA_PATH=/path/to/geo3k | ||
| export PRETRAIN_PATH=/path/to/Qwen2.5-VL-7B-Instruct | ||
| export REWARD_PRETRAIN_PATHS='{"general":"/path/to/general-reward-model"}' | ||
| bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh | ||
| ``` | ||
|
|
||
| Set the dataset and model paths via the environment variables above before running. | ||
|
|
||
| ## Results | ||
|
|
||
| ### Experiment Setup | ||
|
|
||
| This demo has been validated with one real 2-GPU full training run (W&B: [ORM-RL-Demo-QwenVL-7B-Geo3K](https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw)): | ||
|
|
||
| | Item | Value | | ||
| | --- | --- | | ||
| | Actor | Qwen2.5-VL-7B-Instruct | | ||
| | General RM | Qwen2.5-VL-7B general reward model | | ||
| | Dataset | Geo3K | | ||
| | Training engine | FSDP | | ||
| | Inference engine | SGLang (`rm_use_engine=True`) | | ||
| | Reward mixing | `format_reward × 0.1 + general_model_reward × 0.2 + accuracy_reward × 0.7` | | ||
| | Batch sizes | `train_batch_size=128`, `rollout_batch_size=128` | | ||
| | Sampling | `n_samples_per_prompt=8`, `num_episodes=20` | | ||
| | Sequence length | `prompt_max_len=1024`, `generate_max_len=2048` | | ||
| | Optimizer / KL | `actor_learning_rate=1e-6`, `init_kl_coef=0.001`, `lr_warmup_ratio=0.03` | | ||
|
|
||
| Three reward components — rule-based accuracy reward (`accuracy_reward`), ORM scoring (`general_model_reward`, coefficient 0.2), and format reward (`format_reward`) — are combined with the weights above and together compute the final GRPO loss. The `general_model_reward` values shown (e.g. `0.2`) reflect the ORM output (range 0.0 / 0.5 / 1.0) multiplied by the 0.2 coefficient, not the raw model score. | ||
|
|
||
| ### Curve Results | ||
|
|
||
| The run completed successfully (`train/global_step=320`, 16 eval passes): | ||
| - `eval/reward_mean` improved from `0.4636` to `0.5679` | ||
| - Best `eval/reward_mean=0.5686` at `train_step=260` | ||
| - Final `eval/accuracy_reward_mean=0.5166`, `eval/format_reward_mean=0.9956`, `eval/general_model_reward_mean=0.1067` | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| ### Case Study | ||
|
|
||
| Between step 80 and step 320, two question stems appear in both saved trajectories. The following shows the same two questions compared across early and late training. | ||
|
|
||
| #### Question A: Parallelogram Area | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| - Step 80 rewards: `total=0.3`, `format=1.0`, `accuracy=0.0`, `general_model=0.2`, `rule=0.1` | ||
| - Step 320 rewards: `total=1.0`, `format=1.0`, `accuracy=1.0`, `general_model=0.2`, `rule=0.8` | ||
| - The actor already produced a close answer at step 80 so the ORM scored it near 1.0 (contributing 0.2 after the 0.2 coefficient); by step 320 the output moved from `38.97` to the rule-matching `39.0`, flipping `accuracy_reward` from `0.0` to `1.0`. | ||
|
|
||
| #### Question B: Tangent Geometry `y` | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| - Step 80 rewards: `total=0.1`, `format=1.0`, `accuracy=0.0`, `general_model=0.0`, `rule=0.1` | ||
| - Step 320 rewards: `total=1.0`, `format=1.0`, `accuracy=1.0`, `general_model=0.2`, `rule=0.8` | ||
| - At step 80 only format was preserved while both accuracy and ORM failed to reward the answer; by step 320 both became positive contributions. | ||
|
|
||
| ## License | ||
|
|
||
| This project is licensed under the Apache 2.0 License. See [LICENSE](../../LICENSE) for details. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| <div align="center"> | ||
|
|
||
| # ORM RL Demo 训练示例 | ||
|
|
||
| 基于 Geo3K 的 ORM 轨迹打分 RL 训练完整 demo。 | ||
|
|
||
| </div> | ||
|
|
||
| ## 概述 | ||
|
|
||
| 本示例展示了使用 ORM 对轨迹打分进行 RL 训练的完整流程,包含以下配置: | ||
| - 数据集:Geo3K | ||
| - actor:Qwen2.5-VL 7B 模型 | ||
| - reward:单个 general outcome reward model,与规则正确性奖励和格式奖励组合后共同计算 GRPO loss | ||
| - 训练引擎:FSDP,推理引擎:SGLang | ||
|
|
||
| 训练时,actor 在 Geo3K 上生成轨迹,general ORM 对轨迹打分,与规则正确性奖励(`accuracy_reward`)和格式奖励(`format_reward`)三路混合后,共同计算 GRPO loss。为了不直接改写 Geo3K 数据集文件,本 demo 在运行时将数据标签覆盖为 `geo3k_general`,沿用原始数据路径的同时走 general ORM reward 融合逻辑。 | ||
|
|
||
| 环境要求与仓库根目录 [README_zh.md](../../README_zh.md#环境要求) 保持一致,请直接参考主文档。 | ||
|
|
||
| ## 项目结构 | ||
|
|
||
| ```text | ||
| orm_rl_demo/ | ||
| ├── train_colocate.py | ||
| ├── reward_models.py | ||
| ├── reward_models_utils.py | ||
| ├── test_reward_models.py | ||
| └── run_general_fsdp_qwenvl.sh | ||
| ``` | ||
|
|
||
| ## 快速开始 | ||
|
|
||
| 设置数据和模型路径后,运行入口脚本: | ||
|
|
||
| ```bash | ||
| export DATA_PATH=/path/to/geo3k | ||
| export PRETRAIN_PATH=/path/to/Qwen2.5-VL-7B-Instruct | ||
| export REWARD_PRETRAIN_PATHS='{"general":"/path/to/general-reward-model"}' | ||
| bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh | ||
| ``` | ||
|
|
||
| 运行前通过上述环境变量指定数据和模型路径。 | ||
|
|
||
| ## 实验结果 | ||
|
|
||
| ### 实验设置 | ||
|
|
||
| 本 demo 已通过一次真实的 2 卡全量训练验通(W&B run:[ORM-RL-Demo-QwenVL-7B-Geo3K](https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw)),关键配置如下: | ||
|
|
||
| | 项目 | 值 | | ||
| | --- | --- | | ||
| | Actor | Qwen2.5-VL-7B-Instruct | | ||
| | General RM | Qwen2.5-VL-7B general reward model | | ||
| | 数据 | Geo3K | | ||
| | 训练引擎 | FSDP | | ||
| | 推理引擎 | SGLang(`rm_use_engine=True`) | | ||
| | Reward 融合 | `format_reward × 0.1 + general_model_reward × 0.2 + accuracy_reward × 0.7` | | ||
| | Batch 大小 | `train_batch_size=128`, `rollout_batch_size=128` | | ||
| | 采样 | `n_samples_per_prompt=8`, `num_episodes=20` | | ||
| | 长度 | `prompt_max_len=1024`, `generate_max_len=2048` | | ||
| | 优化 / KL | `actor_learning_rate=1e-6`, `init_kl_coef=0.001`, `lr_warmup_ratio=0.03` | | ||
|
|
||
| 三路奖励(规则正确性奖励 `accuracy_reward`、ORM 打分 `general_model_reward` 系数 0.2、格式奖励 `format_reward`)按上述权重混合,共同计算最终的 GRPO loss。其中 `general_model_reward` 对应的 0.2 是权重系数,ORM 模型本身的输出范围为 0.0 / 0.5 / 1.0,乘以 0.2 后得到 reward 贡献。 | ||
|
|
||
| ### 整体曲线结果 | ||
|
|
||
| 训练完整跑完(`train/global_step=320`,共 16 次 eval): | ||
| - `eval/reward_mean` 从 `0.4636` 提升到 `0.5679` | ||
| - Best `eval/reward_mean=0.5686`,出现在 `train_step=260` | ||
| - Final `eval/accuracy_reward_mean=0.5166`,`eval/format_reward_mean=0.9956`,`eval/general_model_reward_mean=0.1067` | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| ### 案例分析 | ||
|
|
||
| Step 80 和 Step 320 之间共有 2 道题目重叠,以下展示这 2 道题从早期到末期的真实对照。 | ||
|
|
||
| #### Question A:平行四边形面积题 | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| - Step 80 reward:`total=0.3`,`format=1.0`,`accuracy=0.0`,`general_model=0.2`,`rule=0.1` | ||
| - Step 320 reward:`total=1.0`,`format=1.0`,`accuracy=1.0`,`general_model=0.2`,`rule=0.8` | ||
| - 含义:step 80 时 actor 答案已很接近,ORM 打出 1.0,乘系数 0.2 后贡献 0.2;到 step 320 时,输出从 `38.97` 修正为规则答案 `39.0`,`accuracy_reward` 从 `0.0` 跳至 `1.0`。 | ||
|
|
||
| #### Question B:切线几何 `y` | ||
|
|
||
|  | ||
|
|
||
|  | ||
|
|
||
| - Step 80 reward:`total=0.1`,`format=1.0`,`accuracy=0.0`,`general_model=0.0`,`rule=0.1` | ||
| - Step 320 reward:`total=1.0`,`format=1.0`,`accuracy=1.0`,`general_model=0.2`,`rule=0.8` | ||
| - 含义:step 80 时只保住了格式,accuracy 和 general RM 均未给分;到 step 320 时,两项均变为正向贡献。 | ||
|
|
||
| ## 许可证 | ||
|
HansBug marked this conversation as resolved.
|
||
|
|
||
| 本项目采用 Apache 2.0 许可证。详见 [LICENSE](../../LICENSE)。 | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.