refactor(sunjx): refactor dataset and reward module by Jiaxuan-Sun · Pull Request #13 · opendilab/LightRFT

Jiaxuan-Sun · 2025-12-31T07:11:42Z

1. Dataset Module Refactoring (`lightrft/datasets/`)

Modified:

__init__.py: Refactored imports with unified interfaces and improved optional dependency handling

Added:

config.py: DatasetConfig class
- Unified configuration for train/eval/pretrain datasets
- Auto-normalization of data_path and data_probs (supports string/list)
- Factory methods: for_train(), for_eval(), for_pretrain()
- Parameter validation
loader.py: DatasetLoader class
- Unified loading interface for train/eval/pretrain datasets
- Automatic handling of blending_datasets parameters
- Support for PromptDatasetVL and SFTDatasetVL
- Consistent logging

2. Reward Module (`lightrft/reward/`)

Added:

__init__.py: Module entry point with unified exports
base.py: BaseReward abstract base class
- Unified compute() method signature
- Consistent return format: (rewards, metrics)
rule.py: RuleReward class
- Rule-based reward implementation
- Format checking (e.g., <think> tags, \boxed{} notation)
- Accuracy verification using mathruler grader
- Registry pattern for custom rule types
- Built-in rules: default, geo3k_*, gsm8k_*
model.py: Reward model implementations
- SingleRewardModel: Single reward model wrapper with auto load/offload
- MultiRewardModel: Multiple reward model ensemble with recipe-based aggregation
- Supports standard PyTorch models and custom engines (e.g., SGLang)
manager.py: RewardManager class
- Unified manager for all reward types
- Auto-selection of reward implementation (rule/single/multi)
- from_config() factory method

…nt 2 URSA paper variant 2 figure-ablation describes per-step PRM reward but does not specify how the resulting N step_scores integrate with the GRPO baseline-subtraction convention. Two interpretations: raw : scatter raw sigmoid step_score directly (paper figure ablation). PRM gives all valid steps a positive value (typically 0.6-0.95) so token-level returns are nearly always positive — weak PG signal (smoke pg ~ 6e-5, ~10^3 weaker than PSGRPO baseline). Reproduces paper's "variant 2 underperforms" observation. group_norm : for each step k, subtract group mean and divide by group std across the K trajectories sharing the same prompt BEFORE scattering. Matches GRPO convention: zero-mean signed advantages with magnitude ~1, restoring directional PG signal. Implementation: - examples/math_prm/train_colocate.py: --per_step_reward_mode CLI flag, default "raw" (preserves prior behavior). - lightrft/trainer/fast_exp_maker.py::_apply_step_reward_group_norm: cross-experience step-level normalization at the entry of _compute_advantages_and_returns. Reshapes per-traj step_rewards to (G, K, max_steps), computes masked mean/std along K dim (padding from step_token_indices < 0 is masked), writes back into experience.info. Downstream compute_reward scatter logic unchanged. - examples/math_prm/run_smoke_per_step_prm_groupnorm.sh: smoke variant exercising --per_step_reward_mode group_norm. Validation: - Math sanity (4 trajectories, geometry holdout opendilab#13): per-step group-normalized values have mean=0, std=1 across all 5 steps. - Smoke v6 end-to-end (K=2 + 1 PPO step + 500-sample eval): alignment_failed=0.00%, outcome=0.5952, n_aligned_steps=4.97 — no regression vs raw mode (small K can't yet show group_norm benefit; effective ablation requires K=8 + multi-step training). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Jiaxuan-Sun added 2 commits December 31, 2025 14:58

refactor(sunjx): refactor dataset and reward module

773a1ee

Remove unnecessary code

513789d

puyuan1996 added the refactor Cleanup, formatting, or restructuring of existing code. label Jan 4, 2026

puyuan1996 requested changes Jan 4, 2026

View reviewed changes

Comment thread lightrft/reward/model.py Outdated

Comment thread lightrft/datasets/__init__.py

Comment thread lightrft/datasets/__init__.py Outdated

puyuan1996 mentioned this pull request Jan 5, 2026

Roadmap for LightRFT v0.1.1 #19

Closed

Jiaxuan-Sun added 4 commits January 7, 2026 15:18

refactor(sunjx): update geo3k to use refactored dataset and reward APIs

875600d

refoctor(sunjx): format code

43c4cba

feature(sunjx): Resolve merge conflicts with opendilab/main

35b9277

Merge branch 'main' into refactor/dataset-reward-module

6989963

puyuan1996 mentioned this pull request Jan 23, 2026

Roadmap for LightRFT v0.1.2 #28

Open

1 task

HansBug mentioned this pull request May 8, 2026

feature(zsh): migrate URSA-MATH stage3 training to LightRFT #53

Open

80 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(sunjx): refactor dataset and reward module#13

refactor(sunjx): refactor dataset and reward module#13
Jiaxuan-Sun wants to merge 6 commits into
opendilab:mainfrom
Jiaxuan-Sun:refactor/dataset-reward-module

Jiaxuan-Sun commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Jiaxuan-Sun commented Dec 31, 2025

1. Dataset Module Refactoring (lightrft/datasets/)

2. Reward Module (lightrft/reward/)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Dataset Module Refactoring (`lightrft/datasets/`)

2. Reward Module (`lightrft/reward/`)