Minimal training repo for on-policy distillation experiments built on top of verl.
This repository is related to the following papers:
-
TIP: Token Importance in On-Policy Distillation (PDF)
- Studies which token positions carry the most useful learning signal in OPD.
- Introduces the TIP view of token importance based on student entropy and teacher-student divergence.
-
PACED: Distillation and On-Policy Self-Distillation at the Frontier of Student Competence (PDF)
- Studies sample importance for distillation and self-distillation at the problem level.
- Proposes weighting problems by student empirical pass rate, emphasizing the frontier of student competence.
- A two-stage forward-then-reverse KL schedule leads to the best performance.
-
Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training (PDF)
- Use RL on a strong teacher model to explore high-reward reasoning behaviors.
- Distill the RL-trained teacher into a smaller student with dense token-level supervision (FKL-OPD two-stage pipeline).
- This teacher-RL + distillation setup outperforms directly training small models with GRPO/RL.
A separate (typically bigger) teacher model and a trainable student model see the same input sequences. The teacher produces better distributions naturally; no ground-truth injection is needed.
- Entry point:
python -m opd.main_opd - Requires
TEACHER_MODEL_PATHenvironment variable - Batch construction:
build_opd_batch(trainer entry point) prefers pre-tokenizedbatch["prompts"]+response_maskso training matches rollout inputs; falls back toraw_prompt+ chat template only if prompts are absent build_opd_batch_multiturn/build_opd_batch_from_verl_batchremain as thin aliases for the prompts-only and raw-prompt-only paths- Supports reward-weighted distillation via
opd.reward_betaconfig
OPD supports multi-turn agent-loop rollouts where the response contains interleaved LLM-generated tokens and tool/environment tokens:
- The trainer preserves the agent-loop
response_mask(1=LLM, 0=tool) instead of recomputing it - The batch builder uses
response_maskas the per-token loss mask so distillation only targets LLM-generated spans build_opd_batchuses pre-tokenized prompt IDs frombatch["prompts"]when present for exact prompt matching
Multi-turn diagnostics are logged: tool_mask/llm_tokens, tool_mask/tool_tokens, tool_mask/tool_ratio, num_turns/*.
scripts/
eval/
grpo/
opd/ # OPD training scripts (separate teacher)
utils/
src/
common/ # Shared batch builder
data/
opd/ # OPD module (separate teacher model)
rewards/
The scripts assume a GPU machine with:
- Python 3
- CUDA and
nvidia-smi verltorchtransformersrayhydratensordict
The setup scripts under scripts/*/setup_*.sh only do lightweight verification plus pip install tensordict; they do not create a full environment from scratch.
The current testing environment is:
verl 0.7.0.7
torch 2.9.1.7
transformers 4.57.1
torchao 0.9.0
torchaudio 2.9.1.1
torchvision 0.24.1.10
By default, training and eval scripts look for data under:
<repo>/data
Expected raw inputs:
data/
DAPO-Math-17k-dedup/distinct-prompts-with-rewards.parquet
AIME_2024/aime_2024_problems.parquet
AIME_2025/train.jsonl
MATH-500/test.jsonl
Generated files:
data/grpo_processed/*.parquetfromsrc/data/prepare_grpo_data.pydata/eval_processed/<variant>/*.parquetfromsrc/data/process_eval_data.py
The training code uses several mechanisms to keep memory usage manageable on long-context math runs:
- FSDP parameter and optimizer offload. The launch scripts enable
actor.fsdp_config.param_offload=True,actor.fsdp_config.optimizer_offload=True, andref.fsdp_config.param_offload=Trueso model weights and optimizer state can be moved off GPU when inactive. - Remove-padding execution. Training scripts set
actor_rollout_ref.model.use_remove_padding=True, and the OPD worker uses unpadded sequence paths so compute and memory scale with real token count instead of padded sequence length. - Two-phase teacher/student execution for distillation. OPD does not keep both teacher and student workloads active on GPU at the same time. The worker first runs teacher-side computation, moves cached teacher statistics or logits to CPU, offloads the teacher, and only then runs the student update step.
- Chunked divergence computation. OPD divergence losses in
src/opd/losses.pyprocess tokens in chunks instead of materializing full-vocabulary probability tensors for the whole batch at once. - Micro-batching in the worker. OPD splits batches using
ppo_micro_batch_size_per_gpuand accumulates gradients across micro-batches to bound activation and logits memory. - Dynamic batch sizing for GRPO. The main GRPO script enables
actor.use_dynamic_bszand caps per-GPU token counts withppo_max_token_len_per_gpuandlog_prob_max_token_len_per_gpu, which is useful when response lengths vary a lot. - Rollout memory controls. The scripts enable
rollout.free_cache_engine=Trueand exposeGPU_MEMORY_UTILso KV-cache usage can be bounded during generation.
In practice, the biggest repo-specific savings come from the OPD two-phase worker design, chunked loss computation, and remove-padding execution.
OPD (src/opd/opd_worker.py) uses a two-phase update:
- Phase 1 (Teacher): Load the teacher (
ref) model, run teacher forwards for all micro-batches, cache teacher logits on CPU, offload teacher. - Phase 2 (Student): Load the student (
actor) model and optimizer, run student forward + divergence loss + backward using cached teacher logits.
This avoids keeping both teacher and student compute active on GPU at the same time during the update step.
OPD supports three divergence types (reverse_kl, forward_kl, jsd), chunk-wise loss computation, and per-sample reward weighting.
GRPO:
bash scripts/grpo/setup_grpo.sh
MODEL_PATH=/path/to/model \
MODEL_NAME=my-model \
bash scripts/grpo/train_grpo.shNative GRPO with KL:
MODEL_PATH=/path/to/model \
MODEL_NAME=my-model \
bash scripts/grpo/train_grpo_native.shNative GRPO without KL:
MODEL_PATH=/path/to/model \
MODEL_NAME=my-model \
bash scripts/grpo/train_grpo_native_no_kl.shOPD (separate teacher, single-turn math):
bash scripts/opd/setup_opd.sh
MODEL_PATH=/path/to/student_model \
TEACHER_MODEL_PATH=/path/to/teacher_model \
MODEL_NAME=my-model \
bash scripts/opd/train_opd.shOPD (separate teacher, multi-turn agent with tool calls):
bash scripts/opd/setup_opd.sh
MODEL_PATH=/path/to/student_model \
TEACHER_MODEL_PATH=/path/to/teacher_model \
DATABASE_DIR=/path/to/tool/database \
MODEL_NAME=my-model \
bash scripts/opd/train_opd_agent.shEvaluation:
MODEL_PATH=/path/to/model \
MODEL_NAME=my-model \
INSTRUCTION_VARIANT=boxed \
REWARD_FUNCTION=math_reward \
bash scripts/eval/eval_math.shCheckpoint conversion:
CHECKPOINT_PATH=/path/to/global_step_54/actor \
bash scripts/utils/convert_checkpoint.shMost training scripts accept overrides through environment variables, including:
MODEL_PATHMODEL_NAMEDATA_DIRTRAIN_BATCH_SIZEPPO_MINI_BATCH_SIZEPPO_MICRO_BATCH_SIZE_PER_GPULEARNING_RATETOTAL_EPOCHSMAX_PROMPT_LENGTHMAX_RESPONSE_LENGTHROLLOUT_NTP_SIZEGPU_MEMORY_UTIL
OPD-specific variables:
TEACHER_MODEL_PATH(required)OPD_LOSS_TYPEOPD_CHUNK_SIZEOPD_MAX_LENGTHOPD_REWARD_BETAENABLE_THINKING
OPD agent additional variables:
ENABLE_TOOLSMAX_ASSISTANT_TURNSMAX_TOOL_RESPONSE_LENGTHTOOL_FORMATAGENT_NUM_WORKERSDATABASE_DIR
Eval variables:
INSTRUCTION_VARIANTREWARD_FUNCTIONVAL_TEMPERATUREVAL_TOP_PVAL_TOP_KVAL_N