feature(zsa): add a minimal general ORM RL example on Geo3K by HansBug · Pull Request #56 · opendilab/LightRFT

HansBug · 2026-04-09T06:12:46Z

Summary

This draft PR adds a new orm_rl_demo example / experiment for minimal ORM-based RL training in LightRFT.

The target example is a single runnable path similar to:

bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh

The intended experiment is:

dataset: Geo3K
base actor: Qwen2.5-VL-7B
outcome reward model: Qwen2.5-VL-72B general ORM
goal: clarify the core ORM RL workflow with a small end-to-end setup that is easier to understand, easier to run, and easier to debug

This PR remains draft because final follow-up review is still pending.

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🎨 Refactoring (code style, formatting, local variables)
⚡ Performance (improvements to code performance)
✅ Testing (adding or fixing tests)
📚 Documentation (updates to documentation)
💥 Breaking change (fix or feature that causes existing functionality to fail)

Related Issues

Fixes #
Related to #

What This Draft Adds

This draft is intended to add one focused example / experiment rather than a broader multi-purpose training surface.

The intended end state is:

one minimal example path under examples/orm_rl_demo
one Geo3K-oriented general ORM RL entrypoint
one base Qwen2.5-VL-7B actor for trajectory generation
one Qwen2.5-VL-72B general outcome reward model for trajectory scoring
one smaller review surface for understanding the ORM RL loop clearly

Testing

Environment:

Python: local runtime validation completed for the narrowed demo
PyTorch: local runtime validation completed for the narrowed demo
CUDA: local runtime validation completed for the narrowed demo

Command(s):

bash examples/orm_rl_demo/run_general_fsdp_qwenvl.sh

Results:

Tests passed locally

Integration Notes

This draft branch has already been synced with the latest upstream main.

While finishing the example cleanup, the main runtime areas worth paying attention to are still:

rollout metadata flow through experience_maker_vl and trajectory_saver
FSDP / engine weight synchronization in broadcast_utils
multimodal generation flow in fast_exp_maker and trainer/utils.py
reward-side response extraction / prompt-contamination risks already addressed upstream in the gsm8k_geo3k rule-reward path

TODO Before Marking This Ready

Checklist

HansBug · 2026-04-17T06:10:30Z

这边补一份基于当前 PR 代码的完整训练验证汇报，使用的 run 是：

汇报形式参考：upstream PR54 doc(pu): add grpo_gsm8k_geo3k_tutorial doc #54
wandb: https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/pcwonr2h
run name: ORM-RL-Demo-Geo3K-General-04161630
commit: a8d338c
state: finished

1. 本次验证的运行配置

本次是用真实 rlaunch 在当前评测镜像下跑的一次完整 2 GPU 长训练验证，核心配置如下：

资源：2 GPU / 40 CPU / 500000 memory
镜像：registry.h.pjlab.org.cn/ailab-rlinfra-rlinfra_gpu/easyr1:lightrft-20260119
数据：/mnt/shared-storage-user/puyuan/data/geo3k
actor：/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
general RM：/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
rollout engine：vllm
RM 推理：rm_use_engine=True，backend 走 vllm
label override：geo3k_general
reward mixing：format/general_model/accuracy = 0.1 / 0.2 / 0.7
num_episodes=20
train_batch_size=128
rollout_batch_size=128
micro_train_batch_size=4
micro_rollout_batch_size=4
n_samples_per_prompt=8
prompt_max_len=1024
generate_max_len=2048
actor_learning_rate=1e-6
init_kl_coef=0.001
lr_warmup_ratio=0.03
eval_steps=20
max_eval_samples=700
zero_stage=3
bf16=True
gradient_checkpointing=True
freeze_prefix=True
adam_offload=True
flash_attn=True
save_trajectories=True，max_ckpt_num=1

2. 训练结果概览

这次 run 是完整跑完的，最终达到了：

train/global_step = 320
eval 一共触发 16 次，对应 train_step = 20, 40, ..., 320

从 wandb 结果看，整体趋势是正常的，主要结论如下：

eval/reward_mean 从 0.4587 提升到 0.5736，绝对提升 +0.1149，相对提升约 +25.0%
best eval/reward_mean 出现在 train_step=240，达到 0.5793
final 相比 best 只回落了 0.0057，说明后半程基本进入平台区，而不是明显崩掉
eval/accuracy_reward_mean 从 0.3936 提升到 0.5225，是这次提升的主要来源
eval/format_reward_mean 从一开始就在 0.99 左右，后面基本稳定，说明 format 约束已经比较早收敛
eval/general_model_reward_mean 从 0.0842 提升到 0.1086，是稳定正增益，不是 0
train/general_model_reward_mean 从 0.0600 提升到 0.1488
train/accuracy_reward_mean 从 0.2734 提升到 0.7168
train/format_reward_mean 从 0.5313 提升到 0.9844
train/response_length_mean 从 333.5 下降到 276.5，eval/response_length_mean 最终稳定在 268.8 左右

如果按当前 reward recipe 理解，这次的 reward 结构是清晰的：

rule_reward_mean = 0.1 * format_reward + 0.7 * accuracy_reward
step_reward_mean = rule_reward_mean + 0.2 * general_model_reward

因此这次总 reward 的上升，主要是 accuracy_reward 拉动，general_model_reward 提供了额外的正向增益，format_reward 更多是在高位稳定。

3. 关于优化信号的补充

train/kl 最终值是 0.4442
原始 history 里存在少量 KL spike，raw max 是 216.2680
为了不让极端点把主趋势压扁，下面的 KL 图做了 p99 clipping
从主趋势看，训练并不是全程失控；更像是中间出现了少量尖峰，但最终仍然回到可接受区间

4. 关键曲线

Summary Card

Reward Dashboard

Optimization Dashboard

5. 当前结论

基于这次完整 run，我这边对当前 PR 代码的判断是：

这条 Geo3K + general ORM RL demo 链路在真实 2 GPU rlaunch 环境下已经可以完整跑通
当前默认采用 vllm 作为 rollout / RM engine backend 的配置是可验通的
从结果上看，训练不是“只起得来”，而是确实学到了东西：accuracy reward、general model reward、total eval reward 都有明确提升
format reward 早期就已经接近饱和，因此这次后续增益主要不是来自 format，而是来自 accuracy 和一部分 general RM
在当前镜像/runtime 约束下，这份 demo 至少已经具备“可以交付一个能跑、能收敛、能给出可解释 wandb 指标”的状态

如果后续需要，我可以再补一版：

和 rule-only Geo3K baseline 的对照表
不同 reward mixing 权重的对照实验
针对 KL spike 的进一步排查结论

HansBug · 2026-04-18T03:06:15Z

补一份基于本 PR 当前代码、这次 full SGLang + RM engine 真实长训练的完整中文实验汇报。汇报形式参考 PR54：#54

1. 本次实验对应的真实运行

W&B run: https://wandb.ai/hansbug/ORM-RL-Demo-QwenVL-7B-Geo3K/runs/zrekazyw
run name: ORM-RL-Demo-Geo3K-General-SGLang-20260417_150451
state: finished
worker launch script: /mnt/shared-storage-user/zhangshaoang/.orm_rl_demo_full_sglang_20260417.sh
raw train log: /mnt/shared-storage-user/zhangshaoang/.orm_rl_demo_full_sglang_20260417_150345.log
save dir: /mnt/shared-storage-user/zhangshaoang/LightRFT/results/orm-rl-demo-general-geo3k-sglang/LightRFT-geo3k-general-orm-sglang-len_1024_2048-tbs_128-rbs_128-sample_8-kl_0.001-warmup_0.03-ep_20-lr_1e-6-20260417_150451
trajectory dir: /mnt/shared-storage-user/zhangshaoang/LightRFT/results/orm-rl-demo-general-geo3k-sglang/LightRFT-geo3k-general-orm-sglang-len_1024_2048-tbs_128-rbs_128-sample_8-kl_0.001-warmup_0.03-ep_20-lr_1e-6-20260417_150451/trajectories

2. 关键启动配置

这次不是本地 smoke，而是实际 rlaunch 起的 2 卡长训练。核心配置如下：

资源：2 GPU / 40 CPU / 500000 memory
镜像：registry.h.pjlab.org.cn/ailab-rlinfra-rlinfra_gpu/easyr1:lightrft-20260119
Conda 环境：/root/miniconda3/envs/lightrft
actor：/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
general RM：/mnt/shared-storage-user/puyuan/model/Qwen2.5-VL-7B-Instruct
数据：/mnt/shared-storage-user/puyuan/data/geo3k
rollout engine：sglang
RM：rm_use_engine=True，backend 走 sglang
reward mixing：format 0.1 + general_model 0.2 + accuracy 0.7
train_batch_size=128, rollout_batch_size=128
micro_train_batch_size=4, micro_rollout_batch_size=4
n_samples_per_prompt=8, num_episodes=20
prompt_max_len=1024, generate_max_len=2048
actor_learning_rate=1e-6, init_kl_coef=0.001, lr_warmup_ratio=0.03
max_ckpt_num=1, save_trajectories=True, num_trajectories_to_save=16

另外，这次 worker 内在启动训练前显式补了 sglang 所需 runtime 环境，核心是：

conda activate /root/miniconda3/envs/lightrft
PYTHONPATH=/mnt/shared-storage-user/zhangshaoang/LightRFT:$PYTHONPATH
LD_LIBRARY_PATH 额外补入：
- /usr/local/nvidia/lib
- /usr/local/nvidia/lib64
- /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cuda_runtime/lib
- /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cudnn/lib
- /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cublas/lib
- /root/miniconda3/envs/lightrft/lib/python3.12/site-packages/nvidia/cuda_nvrtc/lib
- /root/miniconda3/envs/lightrft/lib

3. 核心结果结论

训练完整跑完，最终 train/global_step = 320
本次一共触发 eval = 16 次
eval/reward_mean 从 0.4636 提升到 0.5679
best eval/reward_mean = 0.5686，出现在 train_step = 260
final eval/accuracy_reward_mean = 0.5166
final eval/format_reward_mean = 0.9956
final eval/general_model_reward_mean = 0.1067
final train/general_model_reward_mean = 0.1309
final train/step_reward_mean = 0.6883
final train/kl = 0.5952

我这里对这次 run 的判断是：

当前 PR 代码在真实 2 GPU rlaunch 环境下，Geo3K + ORM RL demo + SGLang rollout + SGLang RM engine 这条链路已经可以完整跑通。
不只是“能启动”，而是 reward 曲线整体正常，accuracy reward 和 general_model_reward 都有正向提升。
format_reward 很早就接近饱和，后续总 reward 的主要增益来自 accuracy，同时 general_model_reward 提供了额外加分。

4. 图表

Summary Card

Reward Dashboard

Optimization Dashboard

5. 从真实 trajectory 中抽样的 3 组样例

下面 3 组都不是手写示意，而是直接从本次真实 run 的 trajectory 文件里抽的，分别覆盖：

最终阶段正确样例
被 general RM 给到部分正向加分、但 accuracy 仍为 0 的样例
只有 format 过关、其余 reward 都没拿到的失败样例

Case A: Final-step correct sample

Source: trajectories_step_320.json, idx=0, image images/step320_exp0_sample0_img0.png
Prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
Output excerpt: ... The area of the parallelogram is approximately \boxed{39.0}.
Reward breakdown: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

Case B: Partial reward from RM support

Source: trajectories_step_80.json, idx=0, image images/step80_exp0_sample0_img0.png
Prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
Output excerpt: ... The area of the parallelogram is approximately 38.97 square feet. \boxed{38.97}
Reward breakdown: total=0.3, format=1.0, accuracy=0.0, general_model=0.2, rule=0.1
这个 case 很典型：答案很接近正确值，但没有命中 accuracy 规则，因此总 reward 主要来自 format(0.1) + general_model(0.2)。

Case C: Format-only failure case

Source: trajectories_step_160.json, idx=8, image images/step160_exp8_sample0_img0.png
Prompt: Find y. Assume that segments that appear to be tangent are tangent. Round to the nearest tenth if necessary.
Output excerpt: ... After calculating, we find that y = 10. </think> The radius y is \boxed{10}.
Reward breakdown: total=0.1, format=1.0, accuracy=0.0, general_model=0.0, rule=0.1
这个 case 说明当前 reward mix 的最低保底就是 format reward；如果 answer 错了、general RM 也不给分，总 reward 就会停在 0.1。

如果后续需要，我可以再补一版基于这次 full run 的 rule-only baseline 对照，或者把这次 run 里 rollout/train/eval 的更细日志统计单独整理成一个附录 comment。

HansBug · 2026-04-18T03:25:58Z

补一条对真实题目 case 展示方式的修正说明。

上一个实验汇报 comment 里的样例是按“不同 reward 形态”选的，但更直接的展示方式其实应该是：固定同一道题，对比 step80 和 step320 的 infer 与 reward 变化。

我重新从这次 full run 的真实 trajectory 里筛了一遍，step80 和 step320 之间实际上只有 2 道共同题目，所以这里正好做成 4 张卡：

Question A @ step80
Question A @ step320
Question B @ step80
Question B @ step320

这 4 张都直接来自真实 trajectory：

Question A step80: trajectories_step_80.json, idx=0
Question A step320: trajectories_step_320.json, idx=0
Question B step80: trajectories_step_80.json, idx=8
Question B step320: trajectories_step_320.json, idx=8

Question A: parallelogram area

Shared prompt: Find the area of the parallelogram. Round to the nearest tenth if necessary.
Step 80 output: ... 38.97 square feet. �oxed{38.97}
Step 320 output: ... �oxed{39.0}
Step 80 rewards: total=0.3, format=1.0, accuracy=0.0, general_model=0.2, rule=0.1
Step 320 rewards: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

这里的变化很典型：step80 时已经“很接近正确答案”，所以 general_model_reward 是正的，但因为没有命中规则答案，accuracy_reward 还是 0；到 step320 时，输出从 38.97 修正成规则答案 39.0，于是总 reward 从 0.3 直接跳到 1.0。

Question B: tangent geometry y

Shared prompt: Find y. Assume that segments that appear to be tangent are tangent. Round to the nearest tenth if necessary.
Step 80 output: ... The radius y is �oxed{10}.
Step 320 output: ... �oxed{12.6}.
Step 80 rewards: total=0.1, format=1.0, accuracy=0.0, general_model=0.0, rule=0.1
Step 320 rewards: total=1.0, format=1.0, accuracy=1.0, general_model=0.2, rule=0.8

这道题的变化更剧烈：step80 时基本只保住了格式，accuracy 和 general RM 都没给分；step320 时，回答变成了完整正确解，所以 accuracy_reward 和 general_model_reward 都转正。

另外，README 我也同步改了：

PR comment 里继续使用 GitHub attachment 链接，便于页面直接阅读
examples/orm_rl_demo/README.md
examples/orm_rl_demo/README_zh.md

这两份 README 里的图片已经全部切到仓库内相对路径，资源放在：

examples/orm_rl_demo/assets/verified_full_run_20260417/

这样文档本身不会依赖外部 attachment 链接。

puyuan1996 · 2026-04-24T10:19:23Z

+Geo3K-specific reward mixing logic that combines format, general-model, and
+accuracy rewards during training and evaluation.
+"""
+from __future__ import annotations


https://github.com/opendilab/LightRFT/blob/main/examples/gsm8k_geo3k/reward_models_utils.py里面有的地方直接从gsm8k_geo3k/reward_models_utils.py这里import吧，这个orm_rl_demo新加的才放到这个文件中

涉及代码重构，暂在本 PR 内跳过，后续单独处理。

详细说明一下现状和评估：

两边重叠的函数实际对比

extract_response、geo3k_accuracy_reward_fn、geo3k_format_reward_fn、gsm8k_accuracy_reward_fn、gsm8k_format_reward_fn 这几个纯规则函数的实现确实几乎一致，只有 docstring 措辞上的细微差异。

但 mix_rewards、reward_fn、load_reward_models 这三个核心函数两边实现差异较大：gsm8k_geo3k 版本是纯规则路径（model_reward_list 永远为空），orm_rl_demo 版本要支持真实 neural RM 的分数融合，两者不能直接互换。

如果真要做，实际操作是什么

因为 examples/ 下的脚本不是 Python package（没有 __init__.py），跨目录 import 只能靠 sys.path.insert，会在两个 example 之间引入隐式依赖。这条路本身就不干净。

真正干净的做法是：把共用的纯工具函数（accuracy/format reward fn、extract_response 等）统一迁移进 lightrft/ 包的某个子模块（比如 lightrft/utils/reward_utils.py），让两边都从 lightrft import。但这涉及改动 lightrft 包本体、补测试、两个 example 的 import 同步调整，范围比只动 example 大。

当前的代价和风险评估

目前两个 example 有意设计成相对独立，orm_rl_demo 的这份 reward_models_utils.py 包含了 RM 加载、engine 配置、reward 融合等 demo 专有逻辑，和 gsm8k_geo3k 的 rule-only 版本功能上并不对称，强行合并反而可能让两边都难维护。

建议这里先保持现状，如果后续要统一 reward utility 的架构，比较合适的方式是单独开一个 issue 来设计这部分（哪些函数应该进 lightrft/ 包、接口怎么定），以免在这个 PR 里顺带做一个风险更大的改动。

puyuan1996 · 2026-04-24T10:21:30Z

+from reward_models_utils import load_reward_models, reward_fn, RECIPE
+
+
+def _apply_label_override(dataset, label_key: str, label_override: str, strategy, dataset_name: str):


除了_apply_label_override这个下面的def train(args):是否可以复用https://github.com/opendilab/LightRFT/blob/main/examples/gsm8k_geo3k/train_colocate.py#L68这里的呢？看看怎么保持代码的简洁和可扩展性哈

涉及代码重构（与 gsm8k_geo3k 的 train 函数复用），暂在本 PR 内跳过，后续单独处理。

详细说明一下现状和评估：

两个 train() 的实际差异在哪

orm_rl_demo 的 train() 和 gsm8k_geo3k 的 train() 主体结构一致，核心差异只有两处：

prompt dataset 和 eval dataset 加载后各有一次 _apply_label_override 调用——这是 orm_rl_demo 专有的逻辑（运行时把 geo3k 的 label 覆盖成 geo3k_general，以走 general ORM reward 融合路径），gsm8k_geo3k 没有这个需求。

一些 critic FSDP 分支和 logging 的细节差异（gsm8k 版本还多了 torch.multiprocessing.set_sharing_strategy 等）。

如果真要复用，实际操作是什么

最直接的办法是给 gsm8k_geo3k 的 train() 加一个可选的 dataset_transform: Optional[Callable] = None 参数，orm_rl_demo 传入一个包了 _apply_label_override 的 callable。但这是把 orm 专有的 hook 需求往更简单的 example 里渗透，方向是反的。

另一条路是把公共 train() 抽成 lightrft/ 包里的一个可扩展基类，两边 example 各自继承并 override 数据预处理步骤。这个设计是合理的长期方向，但属于框架层改动，代价和 review 风险都比这个 PR 本身大。

当前建议

orm_rl_demo 的 train() 已经跑通了完整的 2 卡全量训练，当前保持独立复制的方式是最低风险的选项。如果后续要统一 example 的 train() 架构（支持 dataset transform hook），比较合适的方式是单独开 issue 来设计，避免在这个 PR 里顺带引入更大范围的改动。

…x format - Rename assets/verified_full_run_20260417 -> assets/exp_20260417 - Rewrite README_zh.md: user-facing tone, merge overview sections, restructure experiment results into 实验设置/整体曲线结果/案例分析, clarify reward formula (general_model coefficient 0.2 vs raw ORM output range) - Rewrite README.md with equivalent English changes - Convert all reward_models.py docstrings from Google Args:/Returns: style to Sphinx :param/:type/:return/:rtype format to match repo convention Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…ics to [0, 1] Previously, the logged metrics were weighted contributions rather than raw scores: - general_model_reward was logged as 0.2 × ORM_score, capping at 0.2 - rule_reward was logged as 0.1×format + 0.7×accuracy, ranging [0, 0.8] The final_reward computation (used for training) is unchanged: final = 0.1×fmt + 0.2×orm_score + 0.7×acc Only the metrics dict values are corrected: - general_model_reward now logs the raw ORM output {0.0, 0.5, 1.0} - rule_reward now logs (0.1×fmt + 0.7×acc) / 0.8, normalizing to [0, 1] while preserving the relative weighting between format and accuracy Verified with a 2-GPU run: metrics now fall in [0, 1] and final_reward values are unchanged from previous runs. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

HansBug · 2026-04-27T07:27:39Z

修复：general_model_reward 和 rule_reward metrics 归一化到 [0, 1]

之前这两个 metric 记录的是加权后的贡献量而非原始分，导致数值范围偏小、不直观：

metric	修复前	修复前范围	修复后	修复后范围
`general_model_reward`	`0.2 × ORM分`	[0, 0.2]	ORM 原始输出	{0, 0.5, 1.0}
`rule_reward`	`0.1×fmt + 0.7×acc`	[0, 0.8]	`(0.1×fmt + 0.7×acc) / 0.8`	[0, 1]

训练计算完全不受影响。 final_reward 的公式：

final = 0.1 × format + 0.2 × orm_score + 0.7 × accuracy

没有任何改动，只有 metrics_dict 里的记录值被修正（该字典只用于日志和 trajectory 保存，不参与梯度计算）。

验证结果（2 GPU 真实运行）：

general_model_reward_mean：0.274 ~ 0.399，显然已超过旧上限 0.2 ✅
rule_reward_mean：0.306 ~ 0.445，在 [0, 1] 范围内 ✅
rollout_reward：0.3 ~ 0.44，与修复前同级别，训练未受影响 ✅
交叉验算：step 6 时 (0.1×0.959 + 0.7×0.371)/0.8 = 0.445 与日志完全吻合 ✅

commit：cb2c20e

AltmanD

中文 readme 中相关修改需要同步给英文 readme

AltmanD · 2026-04-29T06:28:18Z

+                num_per_rank = len(texts) // model._tp_size
+                texts = texts[model._tp_rank * num_per_rank : (model._tp_rank + 1) * num_per_rank]
+        else:
+            from vllm import SamplingParams


import 统一放文件开头吧

这里是有意的懒 import。vllm 并非所有运行环境下都一定安装（只走 sglang 的环境没有 vllm），如果放到文件顶部会在这类环境里直接抛 ImportError，导致整个模块无法导入。目前的做法是把 from vllm import SamplingParams 放在实际走 vllm 分支时才执行，避免强依赖。如果后续需要，可以考虑顶部加 try: from vllm import SamplingParams\nexcept ImportError: SamplingParams = None 的写法，但会引入额外的 None 判断，权衡下来暂时保持懒 import 更干净。

- README: fix "reward engine" -> "SGLang" in engine description (FSDP and SGLang are now consistent concepts on the same level) - README: drop redundant "template / no hardcoded paths" sentence; the /path/to/ placeholders in the code block already make this obvious - train_colocate.py: remove verbose banner comment block above init_model_context; the short "# configure model" line is sufficient Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Brings in: - examples/orm_rl_demo: a new general ORM RL example on Geo3K (PR opendilab#56) - minor stat-aggregation changes in spmd_ppo_trainer / ppo_trainer_vl - strategy_base reward-model shard_size = world_size Conflict resolution (3 files): - lightrft/strategy/strategy_base.py: kept HEAD's _resolve_fsdp_shard_size helper (more robust for non-divisible world sizes). - lightrft/trainer/ppo_trainer_vl.py: kept HEAD's generic reward_metric_values = defaultdict(list) collection (subsumes the named-list approach upstream introduced, and is required for the PRM variant 2 diagnostics). - lightrft/trainer/spmd_ppo_trainer.py: kept HEAD's compact rollout-stats aggregation; upstream's elaborate named-block was redundant with the later "Detailed Step Statistics" section HEAD already has. - lightrft/trainer/fast_exp_maker.py: removed duplicate references= kwarg auto-merge fluke. Verification: - python3 ast.parse on all 4 conflict files → clean - python3 -m unittest examples.math_prm.tests.test_ursa_variant2 -v → Ran 9 tests in 0.050s, OK

…sults Resolves all 9 C-severity and 5 I-severity findings from opendilab#53 (comment) C — Critical (blocking) — fixed: - Delete 7 debug-only smoke / fix-verify scripts that hardcoded /home/ubuntu, /mnt/.../puyuan, and /home/ubuntu/miniconda3/.../torchrun: run_grpo_smoke_misalign_fix.sh run_smoke_base_eval_only.sh run_smoke_eval_fix_verify.sh run_smoke_padding_fix_verify.sh run_smoke_per_step_prm.sh run_smoke_per_step_prm_groupnorm.sh run_smoke_paper_variant2.sh Only the two production launchers ship now (PS-GRPO + variant 2). - tools/prepare_ursa_stage3_manifest.py: drop /home/ubuntu/... defaults for --input-path / --image-root; both are now required=True so a fresh user gets a clear missing-arg error instead of silently targeting someone else's home directory. - run_grpo_math_prm_ursa_8b_variant2.sh + run_grpo_math_prm_ursa_8b.sh: add `set -eo pipefail` at the top so a crashed torchrun propagates its exit code through the `2>&1 | tee` pipeline (previously, tee's success masked torchrun crashes and orchestrators saw a green run). I — Important (blocking) — fixed: - examples/math_prm/assets/exp_20260603/{eval_outcome,kl_and_rollout, eval_quality,variant2_health}.png: 4 W&B-derived figures from the 9-day production run, matching the orm_rl_demo/assets/exp_*/ pattern established in PR opendilab#56. - README.md / README_zh.md: add §7 "Results — 9-day production run" section with eval-outcome table, W&B run link, and the 4 figures. - README.md / README_zh.md: add §6 "Strict Paper Eq.9 — variant 2 path" section (formula, math_per_step_prm workflow, sed-relabel command, unit-test invocation) — previously the variant-2 launcher shipped without any README coverage. - README.md / README_zh.md: update §8 files tree to match git ls-files (adds ursa_variant2.py, test_ursa_variant2.py, assets/; removes the 7 deleted smoke scripts). - README.md / README_zh.md: also add Available labels row for math_per_step_prm; expand the "What's Logged" section to include the 13 PRM diagnostic fields + the 7 variant-2 ursa_v2_* fields. - run_grpo_math_prm_ursa_8b.sh:236: switch from `> "${TRAIN_LOG}" 2>&1` to `2>&1 | tee "${TRAIN_LOG}"` so tmux operators see live training output (matches orm_rl_demo / r1_aqa / gsm8k_geo3k launcher convention). - test_ursa_variant2.py: move from examples/math_prm/tests/ to examples/math_prm/ top level to match every other example in the repo. Path-resolution fixed accordingly. M — Minor (non-blocking) — also addressed: - run_grpo_math_prm_ursa_8b_variant2.sh header docstring rewritten to describe variant 2 (was previously a verbatim PS-GRPO copy/paste). - train_colocate.py:28 docstring "usage: python train_grpo_rm_colocate.py" corrected to "python examples/math_prm/train_colocate.py". Verification: $ python3 -m unittest examples.math_prm.test_ursa_variant2 -v Ran 9 tests in 0.034s — OK

HansBug self-assigned this Apr 9, 2026

HansBug added the enhancement New feature or request label Apr 9, 2026

HansBug requested a review from puyuan1996 April 9, 2026 06:14

HansBug added the documentation Improvements or additions to documentation label Apr 9, 2026

dev(hansbug): add math_prm from cluster

fd2588b

HansBug force-pushed the dev/st branch from 602998f to fd2588b Compare April 9, 2026 06:18

HansBug added 2 commits April 9, 2026 14:26

chore(safework): sync runnable example from cluster

3a3066c

merge: sync upstream main into dev/st

dc44de7

HansBug changed the title ~~feature(safework): migrate svkng pipeline from cluster into runnable example~~ feature(orm_rl_demo): narrow scope to a minimal general ORM RL demo Apr 9, 2026

HansBug changed the title ~~feature(orm_rl_demo): narrow scope to a minimal general ORM RL demo~~ feature(safework): migrate svkng pipeline from cluster into runnable example Apr 9, 2026

HansBug changed the title ~~feature(safework): migrate svkng pipeline from cluster into runnable example~~ feature(orm_rl_demo): add a minimal general ORM RL example on Geo3K Apr 9, 2026

HansBug added 5 commits April 9, 2026 15:12

refactor(orm_rl_demo): rename safework example to orm_rl_demo

076be66

refactor(orm_rl_demo): narrow demo to one Geo3K general ORM entry

e14a64b

fix orm rl demo 2gpu bringup

2766906

fix orm rl demo rlaunch bringup

218b89f

fix orm rl demo trajectory analysis arg

b5c119e

puyuan1996 requested changes Apr 14, 2026

View reviewed changes

puyuan1996 changed the title ~~feature(orm_rl_demo): add a minimal general ORM RL example on Geo3K~~ feature(zsa): add a minimal general ORM RL example on Geo3K Apr 14, 2026

HansBug added 2 commits April 15, 2026 10:12

fix orm rl demo reward engine bringup

0e1efe9

address orm rl demo pr review feedback

5bc2f3c

puyuan1996 requested changes Apr 15, 2026

View reviewed changes

puyuan1996 marked this pull request as ready for review April 15, 2026 02:56

HansBug added 5 commits April 15, 2026 18:30

clarify general reward metric names

aa149c6

Fix ORM general RM engine prompts

f9bb867

merge: sync upstream main into dev/st

fd09884

fix: address orm rl demo review feedback

c290079

style: fix trainer yapf formatting

a8d338c

docs(orm_rl_demo): add full-run validation record

8919ece

HansBug added 2 commits April 18, 2026 11:28

docs(orm_rl_demo): store experiment figures in repo

3a9284b

fix(orm_rl_demo): default demo script to sglang

b5a6a16

HansBug requested a review from puyuan1996 April 18, 2026 03:48

puyuan1996 requested changes Apr 24, 2026

View reviewed changes

HansBug and others added 2 commits April 24, 2026 19:33

AltmanD reviewed Apr 29, 2026

View reviewed changes

puyuan1996 merged commit 16db1b0 into opendilab:main Apr 29, 2026
1 check passed

HansBug mentioned this pull request Jun 3, 2026

feature(zsh): migrate URSA-MATH stage3 training to LightRFT #53

Open

80 tasks

		from reward_models_utils import load_reward_models, reward_fn, RECIPE


		def _apply_label_override(dataset, label_key: str, label_override: str, strategy, dataset_name: str):

Conversation

HansBug commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Related Issues

What This Draft Adds

Testing

Integration Notes

TODO Before Marking This Ready

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HansBug commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. 本次验证的运行配置

2. 训练结果概览

3. 关于优化信号的补充

4. 关键曲线

Summary Card

Reward Dashboard

Optimization Dashboard

5. 当前结论

Uh oh!

HansBug commented Apr 18, 2026

1. 本次实验对应的真实运行

2. 关键启动配置

3. 核心结果结论

4. 图表

Summary Card

Reward Dashboard

Optimization Dashboard

5. 从真实 trajectory 中抽样的 3 组样例

Uh oh!

HansBug commented Apr 18, 2026

Question A: parallelogram area

Question B: tangent geometry y

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HansBug commented Apr 27, 2026

Uh oh!

AltmanD left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

HansBug commented Apr 9, 2026 •

edited

Loading

HansBug commented Apr 17, 2026 •

edited

Loading