Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fd2588b
dev(hansbug): add math_prm from cluster
HansBug Feb 24, 2026
3a3066c
chore(safework): sync runnable example from cluster
HansBug Apr 9, 2026
dc44de7
merge: sync upstream main into dev/st
HansBug Apr 9, 2026
076be66
refactor(orm_rl_demo): rename safework example to orm_rl_demo
HansBug Apr 9, 2026
e14a64b
refactor(orm_rl_demo): narrow demo to one Geo3K general ORM entry
HansBug Apr 9, 2026
2766906
fix orm rl demo 2gpu bringup
HansBug Apr 13, 2026
218b89f
fix orm rl demo rlaunch bringup
HansBug Apr 13, 2026
b5c119e
fix orm rl demo trajectory analysis arg
HansBug Apr 13, 2026
0e1efe9
fix orm rl demo reward engine bringup
HansBug Apr 15, 2026
5bc2f3c
address orm rl demo pr review feedback
HansBug Apr 15, 2026
aa149c6
clarify general reward metric names
HansBug Apr 15, 2026
f9bb867
Fix ORM general RM engine prompts
HansBug Apr 15, 2026
fd09884
merge: sync upstream main into dev/st
HansBug Apr 16, 2026
c290079
fix: address orm rl demo review feedback
HansBug Apr 16, 2026
a8d338c
style: fix trainer yapf formatting
HansBug Apr 16, 2026
8919ece
docs(orm_rl_demo): add full-run validation record
HansBug Apr 18, 2026
3a9284b
docs(orm_rl_demo): store experiment figures in repo
HansBug Apr 18, 2026
b5a6a16
fix(orm_rl_demo): default demo script to sglang
HansBug Apr 18, 2026
39fab41
docs(orm_rl_demo): restructure README and convert docstrings to Sphin…
HansBug Apr 24, 2026
cb2c20e
fix(orm_rl_demo): normalize general_model_reward and rule_reward metr…
HansBug Apr 27, 2026
8696f3c
docs(orm_rl_demo): address AltmanD review comments
HansBug Apr 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions examples/orm_rl_demo/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<div align="center">
Comment thread
HansBug marked this conversation as resolved.

# ORM RL Demo

Complete ORM trajectory-scoring RL training demo based on Geo3K.

</div>

## Overview

This demo shows the full pipeline of using an ORM to score trajectories for RL training:
- dataset: Geo3K
- actor: Qwen2.5-VL 7B model
- reward: one general outcome reward model combined with rule-based accuracy reward and format reward, all contributing to the GRPO loss
- training engine: FSDP, inference engine: SGLang

The actor generates Geo3K trajectories, the general ORM scores them, and the scores are combined with a rule-based accuracy reward (`accuracy_reward`) and a format reward (`format_reward`) to compute the final GRPO loss. To avoid rewriting the Geo3K dataset files, the demo overrides the dataset label to `geo3k_general` at runtime so the original dataset path can be reused while routing through the general ORM reward mix.

Environment requirements stay aligned with the repository-level [README.md](../../README.md). Refer to the main project document.

## Project Structure

```text
orm_rl_demo/
├── train_colocate.py
├── reward_models.py
├── reward_models_utils.py
├── test_reward_models.py
└── run_general_fsdp_qwenvl.sh
```

## Quick Start

Set the data and model paths, then run the entry script:

```bash
export DATA_PATH=/path/to/geo3k
export PRETRAIN_PATH=/path/to/Qwen2.5-VL-7B-Instruct
export REWARD_PRETRAIN_PATHS='{"general":"/path/to/general-reward-model"}'
bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh
```

Set the dataset and model paths via the environment variables above before running.

## Results

### Experiment Setup

This demo has been validated with one real 2-GPU full training run (W&B: [ORM-RL-Demo-QwenVL-7B-Geo3K](https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw)):

| Item | Value |
| --- | --- |
| Actor | Qwen2.5-VL-7B-Instruct |
| General RM | Qwen2.5-VL-7B general reward model |
| Dataset | Geo3K |
| Training engine | FSDP |
| Inference engine | SGLang (`rm_use_engine=True`) |
| Reward mixing | `format_reward × 0.1 + general_model_reward × 0.2 + accuracy_reward × 0.7` |
| Batch sizes | `train_batch_size=128`, `rollout_batch_size=128` |
| Sampling | `n_samples_per_prompt=8`, `num_episodes=20` |
| Sequence length | `prompt_max_len=1024`, `generate_max_len=2048` |
| Optimizer / KL | `actor_learning_rate=1e-6`, `init_kl_coef=0.001`, `lr_warmup_ratio=0.03` |

Three reward components — rule-based accuracy reward (`accuracy_reward`), ORM scoring (`general_model_reward`, coefficient 0.2), and format reward (`format_reward`) — are combined with the weights above and together compute the final GRPO loss. The `general_model_reward` values shown (e.g. `0.2`) reflect the ORM output (range 0.0 / 0.5 / 1.0) multiplied by the 0.2 coefficient, not the raw model score.

### Curve Results

The run completed successfully (`train/global_step=320`, 16 eval passes):
- `eval/reward_mean` improved from `0.4636` to `0.5679`
- Best `eval/reward_mean=0.5686` at `train_step=260`
- Final `eval/accuracy_reward_mean=0.5166`, `eval/format_reward_mean=0.9956`, `eval/general_model_reward_mean=0.1067`

![](assets/exp_20260417/summary_card.png)

![](assets/exp_20260417/reward_dashboard.png)

![](assets/exp_20260417/optimization_dashboard.png)

### Case Study

Between step 80 and step 320, two question stems appear in both saved trajectories. The following shows the same two questions compared across early and late training.

#### Question A: Parallelogram Area

![](assets/exp_20260417/question_a_step80.png)

![](assets/exp_20260417/question_a_step320.png)

- Step 80 rewards: `total=0.3`, `format=1.0`, `accuracy=0.0`, `general_model=0.2`, `rule=0.1`
- Step 320 rewards: `total=1.0`, `format=1.0`, `accuracy=1.0`, `general_model=0.2`, `rule=0.8`
- The actor already produced a close answer at step 80 so the ORM scored it near 1.0 (contributing 0.2 after the 0.2 coefficient); by step 320 the output moved from `38.97` to the rule-matching `39.0`, flipping `accuracy_reward` from `0.0` to `1.0`.

#### Question B: Tangent Geometry `y`

![](assets/exp_20260417/question_b_step80.png)

![](assets/exp_20260417/question_b_step320.png)

- Step 80 rewards: `total=0.1`, `format=1.0`, `accuracy=0.0`, `general_model=0.0`, `rule=0.1`
- Step 320 rewards: `total=1.0`, `format=1.0`, `accuracy=1.0`, `general_model=0.2`, `rule=0.8`
- At step 80 only format was preserved while both accuracy and ORM failed to reward the answer; by step 320 both became positive contributions.

## License

This project is licensed under the Apache 2.0 License. See [LICENSE](../../LICENSE) for details.
105 changes: 105 additions & 0 deletions examples/orm_rl_demo/README_zh.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
<div align="center">

# ORM RL Demo 训练示例

基于 Geo3K 的 ORM 轨迹打分 RL 训练完整 demo。

</div>

## 概述

本示例展示了使用 ORM 对轨迹打分进行 RL 训练的完整流程,包含以下配置:
- 数据集:Geo3K
- actor:Qwen2.5-VL 7B 模型
- reward:单个 general outcome reward model,与规则正确性奖励和格式奖励组合后共同计算 GRPO loss
- 训练引擎:FSDP,推理引擎:SGLang

训练时,actor 在 Geo3K 上生成轨迹,general ORM 对轨迹打分,与规则正确性奖励(`accuracy_reward`)和格式奖励(`format_reward`)三路混合后,共同计算 GRPO loss。为了不直接改写 Geo3K 数据集文件,本 demo 在运行时将数据标签覆盖为 `geo3k_general`,沿用原始数据路径的同时走 general ORM reward 融合逻辑。

环境要求与仓库根目录 [README_zh.md](../../README_zh.md#环境要求) 保持一致,请直接参考主文档。

## 项目结构

```text
orm_rl_demo/
├── train_colocate.py
├── reward_models.py
├── reward_models_utils.py
├── test_reward_models.py
└── run_general_fsdp_qwenvl.sh
```

## 快速开始

设置数据和模型路径后,运行入口脚本:

```bash
export DATA_PATH=/path/to/geo3k
export PRETRAIN_PATH=/path/to/Qwen2.5-VL-7B-Instruct
export REWARD_PRETRAIN_PATHS='{"general":"/path/to/general-reward-model"}'
bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh
```

运行前通过上述环境变量指定数据和模型路径。

## 实验结果

### 实验设置

本 demo 已通过一次真实的 2 卡全量训练验通(W&B run:[ORM-RL-Demo-QwenVL-7B-Geo3K](https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw)),关键配置如下:

| 项目 | 值 |
| --- | --- |
| Actor | Qwen2.5-VL-7B-Instruct |
| General RM | Qwen2.5-VL-7B general reward model |
| 数据 | Geo3K |
| 训练引擎 | FSDP |
| 推理引擎 | SGLang(`rm_use_engine=True`) |
| Reward 融合 | `format_reward × 0.1 + general_model_reward × 0.2 + accuracy_reward × 0.7` |
| Batch 大小 | `train_batch_size=128`, `rollout_batch_size=128` |
| 采样 | `n_samples_per_prompt=8`, `num_episodes=20` |
| 长度 | `prompt_max_len=1024`, `generate_max_len=2048` |
| 优化 / KL | `actor_learning_rate=1e-6`, `init_kl_coef=0.001`, `lr_warmup_ratio=0.03` |

三路奖励(规则正确性奖励 `accuracy_reward`、ORM 打分 `general_model_reward` 系数 0.2、格式奖励 `format_reward`)按上述权重混合,共同计算最终的 GRPO loss。其中 `general_model_reward` 对应的 0.2 是权重系数,ORM 模型本身的输出范围为 0.0 / 0.5 / 1.0,乘以 0.2 后得到 reward 贡献。

### 整体曲线结果

训练完整跑完(`train/global_step=320`,共 16 次 eval):
- `eval/reward_mean` 从 `0.4636` 提升到 `0.5679`
- Best `eval/reward_mean=0.5686`,出现在 `train_step=260`
- Final `eval/accuracy_reward_mean=0.5166`,`eval/format_reward_mean=0.9956`,`eval/general_model_reward_mean=0.1067`

![](assets/exp_20260417/summary_card.png)

![](assets/exp_20260417/reward_dashboard.png)

![](assets/exp_20260417/optimization_dashboard.png)

### 案例分析

Step 80 和 Step 320 之间共有 2 道题目重叠,以下展示这 2 道题从早期到末期的真实对照。

#### Question A:平行四边形面积题

![](assets/exp_20260417/question_a_step80.png)

![](assets/exp_20260417/question_a_step320.png)

- Step 80 reward:`total=0.3`,`format=1.0`,`accuracy=0.0`,`general_model=0.2`,`rule=0.1`
- Step 320 reward:`total=1.0`,`format=1.0`,`accuracy=1.0`,`general_model=0.2`,`rule=0.8`
- 含义:step 80 时 actor 答案已很接近,ORM 打出 1.0,乘系数 0.2 后贡献 0.2;到 step 320 时,输出从 `38.97` 修正为规则答案 `39.0`,`accuracy_reward` 从 `0.0` 跳至 `1.0`。

#### Question B:切线几何 `y`

![](assets/exp_20260417/question_b_step80.png)

![](assets/exp_20260417/question_b_step320.png)

- Step 80 reward:`total=0.1`,`format=1.0`,`accuracy=0.0`,`general_model=0.0`,`rule=0.1`
- Step 320 reward:`total=1.0`,`format=1.0`,`accuracy=1.0`,`general_model=0.2`,`rule=0.8`
- 含义:step 80 时只保住了格式,accuracy 和 general RM 均未给分;到 step 320 时,两项均变为正向贡献。

## 许可证
Comment thread
HansBug marked this conversation as resolved.

本项目采用 Apache 2.0 许可证。详见 [LICENSE](../../LICENSE)。
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading