Skip to content

refactor(sunjx): refactor dataset and reward module#13

Open
Jiaxuan-Sun wants to merge 6 commits into
opendilab:mainfrom
Jiaxuan-Sun:refactor/dataset-reward-module
Open

refactor(sunjx): refactor dataset and reward module#13
Jiaxuan-Sun wants to merge 6 commits into
opendilab:mainfrom
Jiaxuan-Sun:refactor/dataset-reward-module

Conversation

@Jiaxuan-Sun

Copy link
Copy Markdown
Contributor

1. Dataset Module Refactoring (lightrft/datasets/)

Modified:

  • __init__.py: Refactored imports with unified interfaces and improved optional dependency handling

Added:

  • config.py: DatasetConfig class

    • Unified configuration for train/eval/pretrain datasets
    • Auto-normalization of data_path and data_probs (supports string/list)
    • Factory methods: for_train(), for_eval(), for_pretrain()
    • Parameter validation
  • loader.py: DatasetLoader class

    • Unified loading interface for train/eval/pretrain datasets
    • Automatic handling of blending_datasets parameters
    • Support for PromptDatasetVL and SFTDatasetVL
    • Consistent logging

2. Reward Module (lightrft/reward/)

Added:

  • __init__.py: Module entry point with unified exports

  • base.py: BaseReward abstract base class

    • Unified compute() method signature
    • Consistent return format: (rewards, metrics)
  • rule.py: RuleReward class

    • Rule-based reward implementation
    • Format checking (e.g., <think> tags, \boxed{} notation)
    • Accuracy verification using mathruler grader
    • Registry pattern for custom rule types
    • Built-in rules: default, geo3k_*, gsm8k_*
  • model.py: Reward model implementations

    • SingleRewardModel: Single reward model wrapper with auto load/offload
    • MultiRewardModel: Multiple reward model ensemble with recipe-based aggregation
    • Supports standard PyTorch models and custom engines (e.g., SGLang)
  • manager.py: RewardManager class

    • Unified manager for all reward types
    • Auto-selection of reward implementation (rule/single/multi)
    • from_config() factory method

@puyuan1996 puyuan1996 added the refactor Cleanup, formatting, or restructuring of existing code. label Jan 4, 2026
Comment thread lightrft/reward/model.py Outdated
Comment thread lightrft/datasets/__init__.py
Comment thread lightrft/datasets/__init__.py Outdated
@puyuan1996 puyuan1996 mentioned this pull request Jan 23, 2026
1 task
HansBug added a commit to HansBug/LightRFT that referenced this pull request May 8, 2026
…nt 2

URSA paper variant 2 figure-ablation describes per-step PRM reward but
does not specify how the resulting N step_scores integrate with the GRPO
baseline-subtraction convention. Two interpretations:

  raw         : scatter raw sigmoid step_score directly (paper figure
                ablation). PRM gives all valid steps a positive value
                (typically 0.6-0.95) so token-level returns are nearly
                always positive — weak PG signal (smoke pg ~ 6e-5,
                ~10^3 weaker than PSGRPO baseline). Reproduces paper's
                "variant 2 underperforms" observation.

  group_norm  : for each step k, subtract group mean and divide by group
                std across the K trajectories sharing the same prompt
                BEFORE scattering. Matches GRPO convention: zero-mean
                signed advantages with magnitude ~1, restoring directional
                PG signal.

Implementation:
- examples/math_prm/train_colocate.py: --per_step_reward_mode CLI flag,
  default "raw" (preserves prior behavior).
- lightrft/trainer/fast_exp_maker.py::_apply_step_reward_group_norm:
  cross-experience step-level normalization at the entry of
  _compute_advantages_and_returns. Reshapes per-traj step_rewards to
  (G, K, max_steps), computes masked mean/std along K dim (padding from
  step_token_indices < 0 is masked), writes back into experience.info.
  Downstream compute_reward scatter logic unchanged.
- examples/math_prm/run_smoke_per_step_prm_groupnorm.sh: smoke variant
  exercising --per_step_reward_mode group_norm.

Validation:
- Math sanity (4 trajectories, geometry holdout opendilab#13): per-step
  group-normalized values have mean=0, std=1 across all 5 steps.
- Smoke v6 end-to-end (K=2 + 1 PPO step + 500-sample eval):
  alignment_failed=0.00%, outcome=0.5952, n_aligned_steps=4.97 — no
  regression vs raw mode (small K can't yet show group_norm benefit;
  effective ablation requires K=8 + multi-step training).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor Cleanup, formatting, or restructuring of existing code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants