From 202acd8a478d7827496876be7e2cb047a50725f6 Mon Sep 17 00:00:00 2001 From: puyuan1996 Date: Tue, 3 Feb 2026 17:44:31 +0800 Subject: [PATCH] doc(pu): add init version of fast_exp_maker best practice --- .../source/best_practice/fast_exp_maker_en.md | 646 ++++++++++++++++++ .../source/best_practice/fast_exp_maker_zh.md | 646 ++++++++++++++++++ 2 files changed, 1292 insertions(+) create mode 100644 docs/source/best_practice/fast_exp_maker_en.md create mode 100644 docs/source/best_practice/fast_exp_maker_zh.md diff --git a/docs/source/best_practice/fast_exp_maker_en.md b/docs/source/best_practice/fast_exp_maker_en.md new file mode 100644 index 00000000..2308e33c --- /dev/null +++ b/docs/source/best_practice/fast_exp_maker_en.md @@ -0,0 +1,646 @@ +# FastExperienceMaker Best Practice Guide + +## Table of Contents +- [Overview](#overview) +- [Core Features](#core-features) +- [Architecture Components](#architecture-components) +- [Usage Guide](#usage-guide) +- [Configuration Parameters](#configuration-parameters) +- [Advantage Estimation Methods](#advantage-estimation-methods) +- [Best Practices](#best-practices) +- [Common Issues and Solutions](#common-issues-and-solutions) +- [Performance Tuning](#performance-tuning) + +## Overview + +### What is FastExperienceMaker? + +FastExperienceMaker is an optimized experience generation engine for RLHF (Reinforcement Learning from Human Feedback) training in LightRFT. It extends the base `NaiveExperienceMaker` with high-performance inference backends (VLLM/SGLang) and advanced RL features. + +### Key Capabilities + +- **High-Performance Inference**: VLLM and SGLang backend support for efficient text generation +- **Multimodal Support**: Vision-language model (VLM) data processing with images and videos +- **Advanced Advantage Estimation**: Multiple methods including GAE, RLOO, REINFORCE, and Group Normalization +- **Flexible Reward Composition**: Support for multiple reward models with custom aggregation functions +- **Sample Packing**: Improved training efficiency through sequence packing +- **Reward Normalization**: Running reward statistics with normalization and clipping + +## Core Features + +### 1. Experience Generation Pipeline + +The FastExperienceMaker implements a 7-stage pipeline for experience generation: + +``` +Stage 1: Sample Generation (VLLM/SGLang) + ↓ +Stage 2: Shard-Parallel Preprocessing + ↓ +Stage 3: Model Inference (Actor, Critic, Initial, Reward Models) + ↓ +Stage 4: Shard-Parallel Postprocessing + ↓ +Stage 5: Reward Processing (Normalization, Shaping, Filtering) + ↓ +Stage 6: Multi-Image/Video Handling + ↓ +Stage 7: Advantage Computation +``` + +### 2. Multimodal Data Processing + +The `MultimodalDataProcessor` class handles mixed text-only and image-text data: + +- **Automatic Separation**: Separates text-only and multimodal samples +- **Appropriate Processing**: Routes samples through tokenizer or multimodal processor +- **Order Preservation**: Maintains original batch ordering after processing +- **Multi-Image/Video Support**: Handles multiple images or videos per sample + +### 3. Reward Computation Engine + +The `RewardComputationEngine` manages reward model inference and aggregation: + +- **Remote Reward Models**: HTTP/gRPC reward model support +- **Local Reward Models**: PyTorch-based reward models +- **Custom Reward Functions**: Python functions for custom reward logic +- **Multi-Model Ensemble**: Combine multiple reward models with custom aggregation +- **Optimized Batching**: Efficient batch processing with optional sample filtering + +## Architecture Components + +### Class Hierarchy + +``` +NaiveExperienceMaker (Base Class) + ↓ +FastExperienceMaker + ├── MultimodalDataProcessor + ├── RewardComputationEngine + └── AdvantageCalculator (GAE/RLOO/REINFORCE/GroupNorm) +``` + +### Key Classes + +#### 1. FastExperienceMaker + +**Purpose**: Main experience generation class with optimized inference and advanced RL features. + +**Initialization Parameters**: +- `packing_samples` (bool): Enable sample packing for efficiency +- `processor`: Multimodal processor for VLM models +- Other parameters inherited from `NaiveExperienceMaker` + +**Key Methods**: +- `make_experience_list()`: Generate experiences from prompts +- `generate_samples()`: Generate samples using inference engine +- `get_advantages_and_returns()`: Compute advantages and returns + +#### 2. MultimodalDataProcessor + +**Purpose**: Handles preprocessing of mixed text-only and multimodal data. + +**Key Responsibilities**: +- Normalize image/video inputs (file paths, PIL images, bytes) +- Separate text-only and multimodal samples +- Process through appropriate pipelines +- Expand samples by `n_samples_per_prompt` factor + +#### 3. RewardComputationEngine + +**Purpose**: Manages reward model inference and score aggregation. + +**Processing Pipeline**: +1. **Gather**: Collect or filter samples based on reward recipe +2. **Process**: Run forward pass through reward model(s) +3. **Aggregate**: Combine scores using reward_fn + +## Usage Guide + +### Basic Usage + +#### Text-Only Generation + +```python +from lightrft.trainer.fast_exp_maker import FastExperienceMaker + +# Initialize experience maker +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + prompt_max_len=512, + kl_controller=kl_controller, + strategy=strategy, + packing_samples=False, +) + +# Generate experiences +prompts = ["Explain quantum computing", "What is machine learning?"] +experiences = exp_maker.make_experience_list( + all_prompts=prompts, + temperature=0.7, + max_new_tokens=512, + top_p=0.9, +) +``` + +#### Vision-Language Generation + +```python +from PIL import Image + +# Initialize with processor for VLM support +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + processor=multimodal_processor, # Required for VLM + prompt_max_len=512, + kl_controller=kl_controller, + strategy=strategy, +) + +# Prepare multimodal data +prompts = ["Describe this image", "What's in this picture?"] +images = [ + [Image.open("image1.jpg")], # Single image + [Image.open("img2.jpg"), Image.open("img3.jpg")], # Multiple images +] +references = ["A cat on a sofa", "Two dogs playing"] + +# Generate experiences +experiences = exp_maker.make_experience_list( + all_prompts=prompts, + all_images=images, + all_references=references, + temperature=0.7, + max_new_tokens=512, +) +``` + +### Advanced Usage + +#### Multiple Reward Models with Custom Aggregation + +```python +# Define custom reward aggregation function +def custom_reward_fn(model_reward_list, labels, queries, refs, label_map): + """ + Custom reward aggregation function. + + Args: + model_reward_list: List of reward tensors from each model + labels: Sample labels + queries: Generated text + refs: Reference texts + label_map: Mapping from reward model names to indices + + Returns: + aggregated_rewards: Combined reward tensor + reward_metrics: Dictionary of detailed metrics + """ + # Example: Weighted average based on label + weights = torch.tensor([0.6, 0.4]) # Weights for two models + aggregated = sum(w * r for w, r in zip(weights, model_reward_list)) + + metrics = { + "reward_model_1": model_reward_list[0].mean(), + "reward_model_2": model_reward_list[1].mean(), + } + + return aggregated, metrics + +# Initialize with multiple reward models +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=[reward_model_1, reward_model_2], # List of models + reward_fn=custom_reward_fn, + reward_fn_label_map={"rm1": 0, "rm2": 1}, + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, +) +``` + +#### Sample Packing for Efficiency + +```python +# Enable sample packing +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, + packing_samples=True, # Enable packing +) + +# Packed format: | prompt1 response1 [EOS] | prompt2 response2 [EOS] | ... +# Benefits: +# - Reduced padding overhead +# - Improved GPU utilization +# - Faster training throughput +``` + +#### Remote Reward Models + +```python +# Use remote reward models via HTTP/gRPC +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=None, # No local reward model + remote_rm_url=[ + "http://reward-server-1:8000/score", + "http://reward-server-2:8000/score", + ], + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, +) +``` + +## Configuration Parameters + +### Generation Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `temperature` | float | 1.0 | Sampling temperature (higher = more random) | +| `top_p` | float | 1.0 | Nucleus sampling threshold | +| `top_k` | int | -1 | Top-k sampling (-1 = disabled) | +| `max_new_tokens` | int | 1024 | Maximum number of tokens to generate | +| `min_new_tokens` | int | 1 | Minimum number of tokens to generate | +| `skip_special_tokens` | bool | False | Skip special tokens in output | + +### Reward Processing Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `reward_running_norm` | bool | False | Enable running reward normalization | +| `reward_running_norm_minus_mean` | bool | False | Subtract mean in normalization | +| `reward_clip` | float | 0 | Reward clipping threshold (0 = disabled) | +| `overlong_buffer` | bool | False | Enable overlong sequence penalty | +| `overlong_buffer_len` | int | 50 | Buffer length for overlong penalty | +| `overlong_buffer_penalty_factor` | float | 1.0 | Penalty factor for overlong sequences | + +### Advantage Estimation Parameters + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `advantage_estimator` | str | "gae" | Method: "gae", "rloo", "reinforce", "group_norm" | +| `advantages_norm` | bool | False | Enable advantage normalization (whitening) | +| `advantage_clip` | float | 0 | Advantage clipping threshold (0 = disabled) | +| `gamma` | float | 1.0 | Discount factor for returns | +| `lambd` | float | 0.95 | GAE lambda parameter | + +## Advantage Estimation Methods + +### 1. GAE (Generalized Advantage Estimation) + +**When to Use**: Standard PPO training with critic model. + +**Advantages**: +- Balances bias-variance tradeoff via lambda parameter +- Smooth advantage estimates +- Works well with value function + +**Configuration**: +```python +strategy.config.advantage_estimator = "gae" +strategy.config.gamma = 1.0 +strategy.config.lambd = 0.95 +``` + +### 2. RLOO (REINFORCE Leave-One-Out) + +**When to Use**: Training without critic model, multiple samples per prompt. + +**Advantages**: +- No critic model required +- Reduces variance through baseline subtraction +- Efficient for multiple samples per prompt + +**Configuration**: +```python +strategy.config.advantage_estimator = "rloo" +strategy.config.n_samples_per_prompt = 4 # Required > 1 +``` + +### 3. REINFORCE with Baseline + +**When to Use**: Simple policy gradient with reward baseline. + +**Advantages**: +- Simple and straightforward +- Works with single sample per prompt +- No critic model required + +**Configuration**: +```python +strategy.config.advantage_estimator = "reinforce" +``` + +### 4. Group Normalization (GRPO) + +**When to Use**: Group-based advantage normalization, multiple samples per prompt. + +**Advantages**: +- Normalizes advantages within each prompt group +- Reduces variance across different prompts +- Effective for diverse prompt distributions + +**Configuration**: +```python +strategy.config.advantage_estimator = "group_norm" +strategy.config.n_samples_per_prompt = 4 # Required > 1 +``` + +### Comparison Table + +| Method | Critic Required | Samples per Prompt | Variance | Bias | Complexity | +|--------|----------------|-------------------|----------|------|------------| +| GAE | Yes | Any | Low | Low | Medium | +| RLOO | No | > 1 | Medium | Low | Low | +| REINFORCE | No | Any | High | Low | Low | +| Group Norm | No | > 1 | Medium | Medium | Low | + +## Best Practices + +### 1. Choosing Inference Backend + +**VLLM**: +- ✅ Best for: Large-scale deployment, high throughput +- ✅ Supports: PagedAttention, continuous batching +- ⚠️ Note: Requires CUDA-compatible GPU + +**SGLang**: +- ✅ Best for: Research, flexibility +- ✅ Supports: Custom sampling, structured generation +- ⚠️ Note: May have different performance characteristics + +### 2. Multimodal Data Handling + +**Image Normalization**: +```python +# Supported formats: +# 1. PIL Image objects +images = [[Image.open("img.jpg")]] + +# 2. File paths (will be loaded automatically) +images = [["path/to/image.jpg"]] + +# 3. Mixed formats +images = [[Image.open("img1.jpg"), "path/to/img2.jpg"]] +``` + +**Multi-Image Scenarios**: +```python +# Multiple images per sample +images = [ + [img1, img2, img3], # Sample 1: 3 images + [img4], # Sample 2: 1 image + None, # Sample 3: No image (text-only) +] +``` + +### 3. Reward Model Configuration + +**Single Reward Model**: +```python +exp_maker = FastExperienceMaker( + reward_model=single_rm, + # ... +) +``` + +**Multiple Reward Models with Aggregation**: +```python +exp_maker = FastExperienceMaker( + reward_model=[rm1, rm2, rm3], + reward_fn=custom_aggregation_fn, + reward_fn_label_map={"quality": 0, "safety": 1, "helpfulness": 2}, + # ... +) +``` + +**Custom Reward Function**: +```python +def custom_reward(queries, prompts, labels): + """Custom reward computation logic""" + rewards = [] + for query, prompt, label in zip(queries, prompts, labels): + # Your custom logic here + score = compute_custom_score(query, prompt, label) + rewards.append(score) + return rewards + +exp_maker = FastExperienceMaker( + custom_reward_func=custom_reward, + # ... +) +``` + +### 4. Memory Optimization + +**Enable Sample Packing**: +```python +# Reduces padding overhead by 30-50% +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +``` + +**Adjust Micro Batch Size**: +```python +# Balance memory usage and throughput +strategy.config.micro_rollout_batch_size = 8 # Adjust based on GPU memory +``` + +**Gradient Checkpointing**: +```python +# Enable for large models +actor.gradient_checkpointing_enable() +``` + +### 5. Reward Normalization Strategy + +**Running Normalization** (Recommended for stable training): +```python +strategy.args.reward_running_norm = True +strategy.args.reward_running_norm_minus_mean = True # Subtract mean +``` + +**Reward Clipping** (Prevent outliers): +```python +strategy.config.reward_clip = 10.0 # Clip to [-10, 10] +``` + +**Advantage Normalization** (Stabilize policy updates): +```python +strategy.config.advantages_norm = True +strategy.config.advantage_clip = 5.0 # Optional clipping +``` + +### 6. Handling Overlong Sequences + +```python +# Penalize sequences that are too long +strategy.config.overlong_buffer = True +strategy.config.overlong_buffer_len = 50 # Buffer length +strategy.config.overlong_buffer_penalty_factor = 1.0 # Penalty strength + +# Example: If max_new_tokens=512 and buffer_len=50 +# Expected length = 512 - 50 = 462 +# Sequences longer than 462 tokens receive penalty +``` + +## Common Issues and Solutions + +### Issue 1: Out of Memory (OOM) + +**Symptoms**: CUDA out of memory error during experience generation. + +**Solutions**: +1. Enable sample packing: `packing_samples=True` +2. Reduce micro batch size: `strategy.config.micro_rollout_batch_size = 4` +3. Enable gradient checkpointing: `actor.gradient_checkpointing_enable()` +4. Reduce max sequence length: `max_new_tokens=256` +5. Use smaller model or quantization + +### Issue 2: Slow Generation Speed + +**Symptoms**: Experience generation takes too long. + +**Solutions**: +1. Use VLLM backend: `strategy.args.engine_type = "vllm"` +2. Increase batch size: `strategy.config.micro_rollout_batch_size = 16` +3. Enable sample packing: `packing_samples=True` +4. Check GPU utilization: Ensure GPU is fully utilized +5. Reduce `max_new_tokens` if possible + +### Issue 3: Unstable Training + +**Symptoms**: Reward or loss fluctuates wildly during training. + +**Solutions**: +1. Enable reward normalization: + ```python + strategy.args.reward_running_norm = True + strategy.args.reward_running_norm_minus_mean = True + ``` +2. Enable advantage normalization: + ```python + strategy.config.advantages_norm = True + ``` +3. Add reward clipping: + ```python + strategy.config.reward_clip = 10.0 + ``` +4. Reduce learning rate +5. Use GAE with appropriate lambda: `strategy.config.lambd = 0.95` + +### Issue 4: Image Token Mismatch (VLM) + +**Symptoms**: Warning message about token/patch mismatch during rollout. + +**Cause**: Number of image tokens doesn't match pixel value patches. + +**Solution**: This is automatically fixed by FastExperienceMaker. The warning is informational only. If it occurs frequently: +1. Check image preprocessing pipeline +2. Verify processor configuration +3. Ensure consistent image format across samples + +### Issue 5: RLOO Requires Multiple Samples + +**Symptoms**: Error when using RLOO with `n_samples_per_prompt = 1`. + +**Cause**: RLOO requires multiple samples per prompt for baseline computation. + +**Solution**: +```python +# Set n_samples_per_prompt > 1 +strategy.config.n_samples_per_prompt = 4 +strategy.config.advantage_estimator = "rloo" +``` + +Or switch to another method: +```python +# Use GAE or REINFORCE instead +strategy.config.advantage_estimator = "gae" +``` + +### Issue 6: Remote Reward Model Timeout + +**Symptoms**: Timeout errors when using remote reward models. + +**Solutions**: +1. Check network connectivity to reward model server +2. Increase timeout in remote_rm_fn configuration +3. Reduce batch size to avoid long processing times +4. Consider using local reward models for better performance +5. Implement retry logic in custom reward function + +## Performance Tuning + +### Throughput Optimization + +**Recommended Configuration for Maximum Throughput**: +```python +strategy.args.engine_type = "vllm" +strategy.config.micro_rollout_batch_size = 16 # Adjust based on GPU memory +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +``` + +**Expected Performance**: +- VLLM backend: 2-5x faster than HuggingFace generate +- Sample packing: 30-50% reduction in padding overhead +- Batch processing: Linear scaling with batch size (up to GPU memory limit) + +### Memory Efficiency + +**Recommended Configuration for Memory-Constrained Environments**: +```python +strategy.config.micro_rollout_batch_size = 4 +strategy.config.max_new_tokens = 256 +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +actor.gradient_checkpointing_enable() +``` + +## References + +### Related Documentation +- [Strategy Design Philosophy](strategy_design_philosophy.md) +- [Model Design Document](model.md) +- [Reward Model Best Practices](reward_model.md) + +### Code References +- FastExperienceMaker: `lightrft/trainer/fast_exp_maker.py` +- Base ExperienceMaker: `lightrft/trainer/experience_maker.py` +- Advantage Calculators: `lightrft/trainer/advantage_calculator.py` +- VLLM Utils: `lightrft/strategy/vllm_utils/` + +### Research Papers +- **GAE**: "High-Dimensional Continuous Control Using Generalized Advantage Estimation" (Schulman et al., 2016) +- **PPO**: "Proximal Policy Optimization Algorithms" (Schulman et al., 2017) +- **RLOO**: "Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs" (Ahmadian et al., 2024) + +--- + +**Document Version**: 1.0 +**Last Updated**: 2026-02-03 +**Maintainer**: LightRFT Team diff --git a/docs/source/best_practice/fast_exp_maker_zh.md b/docs/source/best_practice/fast_exp_maker_zh.md new file mode 100644 index 00000000..e8afcb34 --- /dev/null +++ b/docs/source/best_practice/fast_exp_maker_zh.md @@ -0,0 +1,646 @@ +# FastExperienceMaker 最佳实践指南 + +## 目录 +- [概述](#概述) +- [核心功能](#核心功能) +- [架构组件](#架构组件) +- [使用指南](#使用指南) +- [配置参数](#配置参数) +- [优势估计方法](#优势估计方法) +- [最佳实践](#最佳实践) +- [常见问题与解决方案](#常见问题与解决方案) +- [性能调优](#性能调优) + +## 概述 + +### 什么是 FastExperienceMaker? + +FastExperienceMaker 是 LightRFT 中用于 RLHF(从人类反馈中强化学习)训练的优化经验生成引擎。它扩展了基础的 `NaiveExperienceMaker`,支持高性能推理后端(VLLM/SGLang)和高级强化学习特性。 + +### 核心能力 + +- **高性能推理**:支持 VLLM 和 SGLang 后端,实现高效文本生成 +- **多模态支持**:支持视觉-语言模型(VLM)的图像和视频数据处理 +- **高级优势估计**:支持多种方法,包括 GAE、RLOO、REINFORCE 和 Group Normalization +- **灵活的奖励组合**:支持多个奖励模型和自定义聚合函数 +- **样本打包**:通过序列打包提高训练效率 +- **奖励归一化**:运行时奖励统计,支持归一化和裁剪 + +## 核心功能 + +### 1. 经验生成流程 + +FastExperienceMaker 实现了一个 7 阶段的经验生成流程: + +``` +阶段 1: 样本生成 (VLLM/SGLang) + ↓ +阶段 2: 分片并行预处理 + ↓ +阶段 3: 模型推理 (Actor, Critic, Initial, Reward Models) + ↓ +阶段 4: 分片并行后处理 + ↓ +阶段 5: 奖励处理 (归一化、塑形、过滤) + ↓ +阶段 6: 多图像/视频处理 + ↓ +阶段 7: 优势计算 +``` + +### 2. 多模态数据处理 + +`MultimodalDataProcessor` 类处理纯文本和图像-文本混合数据: + +- **自动分离**:分离纯文本和多模态样本 +- **适当处理**:通过分词器或多模态处理器路由样本 +- **顺序保持**:处理后保持原始批次顺序 +- **多图像/视频支持**:处理每个样本的多个图像或视频 + +### 3. 奖励计算引擎 + +`RewardComputationEngine` 管理奖励模型推理和聚合: + +- **远程奖励模型**:支持 HTTP/gRPC 奖励模型 +- **本地奖励模型**:基于 PyTorch 的奖励模型 +- **自定义奖励函数**:用于自定义奖励逻辑的 Python 函数 +- **多模型集成**:使用自定义聚合组合多个奖励模型 +- **优化批处理**:高效的批处理,支持可选的样本过滤 + +## 架构组件 + +### 类层次结构 + +``` +NaiveExperienceMaker (基类) + ↓ +FastExperienceMaker + ├── MultimodalDataProcessor + ├── RewardComputationEngine + └── AdvantageCalculator (GAE/RLOO/REINFORCE/GroupNorm) +``` + +### 关键类 + +#### 1. FastExperienceMaker + +**用途**:主要的经验生成类,具有优化推理和高级强化学习特性。 + +**初始化参数**: +- `packing_samples` (bool):启用样本打包以提高效率 +- `processor`:用于 VLM 模型的多模态处理器 +- 其他参数继承自 `NaiveExperienceMaker` + +**关键方法**: +- `make_experience_list()`:从提示生成经验 +- `generate_samples()`:使用推理引擎生成样本 +- `get_advantages_and_returns()`:计算优势和回报 + +#### 2. MultimodalDataProcessor + +**用途**:处理纯文本和多模态混合数据的预处理。 + +**主要职责**: +- 归一化图像/视频输入(文件路径、PIL 图像、字节) +- 分离纯文本和多模态样本 +- 通过适当的流程处理 +- 按 `n_samples_per_prompt` 因子扩展样本 + +#### 3. RewardComputationEngine + +**用途**:管理奖励模型推理和分数聚合。 + +**处理流程**: +1. **收集**:根据奖励配方收集或过滤样本 +2. **处理**:通过奖励模型运行前向传播 +3. **聚合**:使用 reward_fn 组合分数 + +## 使用指南 + +### 基础用法 + +#### 纯文本生成 + +```python +from lightrft.trainer.fast_exp_maker import FastExperienceMaker + +# 初始化经验生成器 +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + prompt_max_len=512, + kl_controller=kl_controller, + strategy=strategy, + packing_samples=False, +) + +# 生成经验 +prompts = ["解释量子计算", "什么是机器学习?"] +experiences = exp_maker.make_experience_list( + all_prompts=prompts, + temperature=0.7, + max_new_tokens=512, + top_p=0.9, +) +``` + +#### 视觉-语言生成 + +```python +from PIL import Image + +# 使用处理器初始化以支持 VLM +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + processor=multimodal_processor, # VLM 必需 + prompt_max_len=512, + kl_controller=kl_controller, + strategy=strategy, +) + +# 准备多模态数据 +prompts = ["描述这张图片", "图片里有什么?"] +images = [ + [Image.open("image1.jpg")], # 单张图片 + [Image.open("img2.jpg"), Image.open("img3.jpg")], # 多张图片 +] +references = ["沙发上的一只猫", "两只狗在玩耍"] + +# 生成经验 +experiences = exp_maker.make_experience_list( + all_prompts=prompts, + all_images=images, + all_references=references, + temperature=0.7, + max_new_tokens=512, +) +``` + +### 高级用法 + +#### 多奖励模型与自定义聚合 + +```python +# 定义自定义奖励聚合函数 +def custom_reward_fn(model_reward_list, labels, queries, refs, label_map): + """ + 自定义奖励聚合函数。 + + 参数: + model_reward_list: 每个模型的奖励张量列表 + labels: 样本标签 + queries: 生成的文本 + refs: 参考文本 + label_map: 奖励模型名称到索引的映射 + + 返回: + aggregated_rewards: 组合的奖励张量 + reward_metrics: 详细指标字典 + """ + # 示例:基于标签的加权平均 + weights = torch.tensor([0.6, 0.4]) # 两个模型的权重 + aggregated = sum(w * r for w, r in zip(weights, model_reward_list)) + + metrics = { + "reward_model_1": model_reward_list[0].mean(), + "reward_model_2": model_reward_list[1].mean(), + } + + return aggregated, metrics + +# 使用多个奖励模型初始化 +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=[reward_model_1, reward_model_2], # 模型列表 + reward_fn=custom_reward_fn, + reward_fn_label_map={"rm1": 0, "rm2": 1}, + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, +) +``` + +#### 样本打包以提高效率 + +```python +# 启用样本打包 +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=reward_model, + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, + packing_samples=True, # 启用打包 +) + +# 打包格式:| prompt1 response1 [EOS] | prompt2 response2 [EOS] | ... +# 优势: +# - 减少填充开销 +# - 提高 GPU 利用率 +# - 更快的训练吞吐量 +``` + +#### 远程奖励模型 + +```python +# 通过 HTTP/gRPC 使用远程奖励模型 +exp_maker = FastExperienceMaker( + actor=actor_model, + critic=critic_model, + reward_model=None, # 无本地奖励模型 + remote_rm_url=[ + "http://reward-server-1:8000/score", + "http://reward-server-2:8000/score", + ], + initial_model=initial_model, + tokenizer=tokenizer, + strategy=strategy, +) +``` + +## 配置参数 + +### 生成参数 + +| 参数 | 类型 | 默认值 | 描述 | +|------|------|--------|------| +| `temperature` | float | 1.0 | 采样温度(越高越随机) | +| `top_p` | float | 1.0 | 核采样阈值 | +| `top_k` | int | -1 | Top-k 采样(-1 = 禁用) | +| `max_new_tokens` | int | 1024 | 生成的最大令牌数 | +| `min_new_tokens` | int | 1 | 生成的最小令牌数 | +| `skip_special_tokens` | bool | False | 输出中跳过特殊令牌 | + +### 奖励处理参数 + +| 参数 | 类型 | 默认值 | 描述 | +|------|------|--------|------| +| `reward_running_norm` | bool | False | 启用运行时奖励归一化 | +| `reward_running_norm_minus_mean` | bool | False | 归一化时减去均值 | +| `reward_clip` | float | 0 | 奖励裁剪阈值(0 = 禁用) | +| `overlong_buffer` | bool | False | 启用过长序列惩罚 | +| `overlong_buffer_len` | int | 50 | 过长惩罚的缓冲区长度 | +| `overlong_buffer_penalty_factor` | float | 1.0 | 过长序列的惩罚因子 | + +### 优势估计参数 + +| 参数 | 类型 | 默认值 | 描述 | +|------|------|--------|------| +| `advantage_estimator` | str | "gae" | 方法:"gae"、"rloo"、"reinforce"、"group_norm" | +| `advantages_norm` | bool | False | 启用优势归一化(白化) | +| `advantage_clip` | float | 0 | 优势裁剪阈值(0 = 禁用) | +| `gamma` | float | 1.0 | 回报的折扣因子 | +| `lambd` | float | 0.95 | GAE lambda 参数 | + +## 优势估计方法 + +### 1. GAE(广义优势估计) + +**何时使用**:标准 PPO 训练,需要 critic 模型。 + +**优势**: +- 通过 lambda 参数平衡偏差-方差权衡 +- 平滑的优势估计 +- 与价值函数配合良好 + +**配置**: +```python +strategy.config.advantage_estimator = "gae" +strategy.config.gamma = 1.0 +strategy.config.lambd = 0.95 +``` + +### 2. RLOO(REINFORCE Leave-One-Out) + +**何时使用**:无 critic 模型训练,每个提示生成多个样本。 + +**优势**: +- 不需要 critic 模型 +- 通过基线减法减少方差 +- 对每个提示的多个样本高效 + +**配置**: +```python +strategy.config.advantage_estimator = "rloo" +strategy.config.n_samples_per_prompt = 4 # 必须 > 1 +``` + +### 3. REINFORCE with Baseline + +**何时使用**:简单的策略梯度,使用奖励基线。 + +**优势**: +- 简单直接 +- 适用于每个提示的单个样本 +- 不需要 critic 模型 + +**配置**: +```python +strategy.config.advantage_estimator = "reinforce" +``` + +### 4. Group Normalization (GRPO) + +**何时使用**:基于组的优势归一化,每个提示生成多个样本。 + +**优势**: +- 在每个提示组内归一化优势 +- 减少不同提示之间的方差 +- 对多样化的提示分布有效 + +**配置**: +```python +strategy.config.advantage_estimator = "group_norm" +strategy.config.n_samples_per_prompt = 4 # 必须 > 1 +``` + +### 方法对比表 + +| 方法 | 需要 Critic | 每个提示的样本数 | 方差 | 偏差 | 复杂度 | +|------|------------|----------------|------|------|--------| +| GAE | 是 | 任意 | 低 | 低 | 中等 | +| RLOO | 否 | > 1 | 中等 | 低 | 低 | +| REINFORCE | 否 | 任意 | 高 | 低 | 低 | +| Group Norm | 否 | > 1 | 中等 | 中等 | 低 | + +## 最佳实践 + +### 1. 选择推理后端 + +**VLLM**: +- ✅ 最适合:大规模部署、高吞吐量 +- ✅ 支持:PagedAttention、连续批处理 +- ⚠️ 注意:需要 CUDA 兼容的 GPU + +**SGLang**: +- ✅ 最适合:研究、灵活性 +- ✅ 支持:自定义采样、结构化生成 +- ⚠️ 注意:可能有不同的性能特征 + +### 2. 多模态数据处理 + +**图像归一化**: +```python +# 支持的格式: +# 1. PIL Image 对象 +images = [[Image.open("img.jpg")]] + +# 2. 文件路径(将自动加载) +images = [["path/to/image.jpg"]] + +# 3. 混合格式 +images = [[Image.open("img1.jpg"), "path/to/img2.jpg"]] +``` + +**多图像场景**: +```python +# 每个样本多张图片 +images = [ + [img1, img2, img3], # 样本 1:3 张图片 + [img4], # 样本 2:1 张图片 + None, # 样本 3:无图片(纯文本) +] +``` + +### 3. 奖励模型配置 + +**单个奖励模型**: +```python +exp_maker = FastExperienceMaker( + reward_model=single_rm, + # ... +) +``` + +**多个奖励模型与聚合**: +```python +exp_maker = FastExperienceMaker( + reward_model=[rm1, rm2, rm3], + reward_fn=custom_aggregation_fn, + reward_fn_label_map={"quality": 0, "safety": 1, "helpfulness": 2}, + # ... +) +``` + +**自定义奖励函数**: +```python +def custom_reward(queries, prompts, labels): + """自定义奖励计算逻辑""" + rewards = [] + for query, prompt, label in zip(queries, prompts, labels): + # 你的自定义逻辑 + score = compute_custom_score(query, prompt, label) + rewards.append(score) + return rewards + +exp_maker = FastExperienceMaker( + custom_reward_func=custom_reward, + # ... +) +``` + +### 4. 内存优化 + +**启用样本打包**: +```python +# 减少 30-50% 的填充开销 +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +``` + +**调整微批次大小**: +```python +# 平衡内存使用和吞吐量 +strategy.config.micro_rollout_batch_size = 8 # 根据 GPU 内存调整 +``` + +**梯度检查点**: +```python +# 为大型模型启用 +actor.gradient_checkpointing_enable() +``` + +### 5. 奖励归一化策略 + +**运行时归一化**(推荐用于稳定训练): +```python +strategy.args.reward_running_norm = True +strategy.args.reward_running_norm_minus_mean = True # 减去均值 +``` + +**奖励裁剪**(防止异常值): +```python +strategy.config.reward_clip = 10.0 # 裁剪到 [-10, 10] +``` + +**优势归一化**(稳定策略更新): +```python +strategy.config.advantages_norm = True +strategy.config.advantage_clip = 5.0 # 可选裁剪 +``` + +### 6. 处理过长序列 + +```python +# 惩罚过长的序列 +strategy.config.overlong_buffer = True +strategy.config.overlong_buffer_len = 50 # 缓冲区长度 +strategy.config.overlong_buffer_penalty_factor = 1.0 # 惩罚强度 + +# 示例:如果 max_new_tokens=512 且 buffer_len=50 +# 预期长度 = 512 - 50 = 462 +# 长度超过 462 个令牌的序列将受到惩罚 +``` + +## 常见问题与解决方案 + +### 问题 1:内存不足(OOM) + +**症状**:经验生成过程中出现 CUDA 内存不足错误。 + +**解决方案**: +1. 启用样本打包:`packing_samples=True` +2. 减小微批次大小:`strategy.config.micro_rollout_batch_size = 4` +3. 启用梯度检查点:`actor.gradient_checkpointing_enable()` +4. 减少最大序列长度:`max_new_tokens=256` +5. 使用更小的模型或量化 + +### 问题 2:生成速度慢 + +**症状**:经验生成耗时过长。 + +**解决方案**: +1. 使用 VLLM 后端:`strategy.args.engine_type = "vllm"` +2. 增加批次大小:`strategy.config.micro_rollout_batch_size = 16` +3. 启用样本打包:`packing_samples=True` +4. 检查 GPU 利用率:确保 GPU 充分利用 +5. 如果可能,减少 `max_new_tokens` + +### 问题 3:训练不稳定 + +**症状**:训练过程中奖励或损失剧烈波动。 + +**解决方案**: +1. 启用奖励归一化: + ```python + strategy.args.reward_running_norm = True + strategy.args.reward_running_norm_minus_mean = True + ``` +2. 启用优势归一化: + ```python + strategy.config.advantages_norm = True + ``` +3. 添加奖励裁剪: + ```python + strategy.config.reward_clip = 10.0 + ``` +4. 降低学习率 +5. 使用 GAE 并设置适当的 lambda:`strategy.config.lambd = 0.95` + +### 问题 4:图像令牌不匹配(VLM) + +**症状**:推理过程中出现令牌/补丁不匹配的警告消息。 + +**原因**:图像令牌数量与像素值补丁不匹配。 + +**解决方案**:FastExperienceMaker 会自动修复此问题。警告仅供参考。如果频繁出现: +1. 检查图像预处理流程 +2. 验证处理器配置 +3. 确保样本之间的图像格式一致 + +### 问题 5:RLOO 需要多个样本 + +**症状**:使用 RLOO 时 `n_samples_per_prompt = 1` 导致错误。 + +**原因**:RLOO 需要每个提示生成多个样本来计算基线。 + +**解决方案**: +```python +# 设置 n_samples_per_prompt > 1 +strategy.config.n_samples_per_prompt = 4 +strategy.config.advantage_estimator = "rloo" +``` + +或切换到其他方法: +```python +# 改用 GAE 或 REINFORCE +strategy.config.advantage_estimator = "gae" +``` + +### 问题 6:远程奖励模型超时 + +**症状**:使用远程奖励模型时出现超时错误。 + +**解决方案**: +1. 检查到奖励模型服务器的网络连接 +2. 在 remote_rm_fn 配置中增加超时时间 +3. 减小批次大小以避免长时间处理 +4. 考虑使用本地奖励模型以获得更好的性能 +5. 在自定义奖励函数中实现重试逻辑 + +## 性能调优 + +### 吞吐量优化 + +**最大吞吐量的推荐配置**: +```python +strategy.args.engine_type = "vllm" +strategy.config.micro_rollout_batch_size = 16 # 根据 GPU 内存调整 +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +``` + +**预期性能**: +- VLLM 后端:比 HuggingFace generate 快 2-5 倍 +- 样本打包:减少 30-50% 的填充开销 +- 批处理:批次大小线性扩展(直到 GPU 内存限制) + +### 内存效率 + +**内存受限环境的推荐配置**: +```python +strategy.config.micro_rollout_batch_size = 4 +strategy.config.max_new_tokens = 256 +exp_maker = FastExperienceMaker( + packing_samples=True, + # ... +) +actor.gradient_checkpointing_enable() +``` + +## 参考资料 + +### 相关文档 +- [策略设计哲学](strategy_design_philosophy.md) +- [模型设计文档](model.md) +- [奖励模型最佳实践](reward_model.md) + +### 代码引用 +- FastExperienceMaker:`lightrft/trainer/fast_exp_maker.py` +- 基础 ExperienceMaker:`lightrft/trainer/experience_maker.py` +- 优势计算器:`lightrft/trainer/advantage_calculator.py` +- VLLM 工具:`lightrft/strategy/vllm_utils/` + +### 研究论文 +- **GAE**:"High-Dimensional Continuous Control Using Generalized Advantage Estimation"(Schulman 等,2016) +- **PPO**:"Proximal Policy Optimization Algorithms"(Schulman 等,2017) +- **RLOO**:"Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs"(Ahmadian 等,2024) + +--- + +**文档版本**:1.0 +**最后更新**:2026-02-03 +**维护者**:LightRFT 团队