CRISP: Compressed Reasoning via Iterative Self-Policy Distillation (Original OPSDC On-Policy Self-Distillation for Reasoning Compression)
This repository contains the code for CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches reasoning models to think more concisely by distilling their own concise behavior back into themselves.
Paper: CRISP: Compressed Reasoning via Iterative Self-Policy Distillation | arXiv
Authors: Hejian Sang*, Yuanda Xu*, Zhengze Zhou*, Ran He*, Zhipeng Wang, Jiachen Sun
Related write-up: Scorer Choice in Math Reasoning Evaluation — a four-policy decomposition of how verifier choice (answer-extraction vs. symbolic equivalence) can swing reported MATH-500 accuracy by up to ~80 percentage points on identical generations.
Reasoning models think out loud, but much of what they say is noise. CRISP uses a single, almost trivial idea: ask the model to be concise, then teach it to do so without being asked.
- Teacher: The same model conditioned on a conciseness instruction (e.g., "Solve concisely, avoid unnecessary steps")
- Student: The same model without the conciseness instruction
Training generates student rollouts and minimizes per-token reverse KL divergence between student and teacher distributions. No ground-truth answers, no token budgets, no difficulty estimators.
| Model | Benchmark | Token Reduction | Accuracy Change |
|---|---|---|---|
| Qwen3-8B | MATH-500 | 59% | +9 pts (77% → 86%) |
| Qwen3-14B | MATH-500 | 57% | +16 pts (70% → 86%) |
| Qwen3-14B | AIME 2024 | 41% | +10 pts |
Compression naturally adapts to problem difficulty (~1.6x more compression on easy vs. hard problems), entropy remains stable throughout training, and general capabilities (MMLU) are fully preserved.
OnPolicySD-open/
├── verl/ # VERL framework (forked, with minor fixes)
├── workspace/
│ ├── config/
│ │ └── prompts.json # Prompt templates (student, teacher, length prune)
│ ├── data/
│ │ ├── DAPO-Math-17k-dedup/ # Training data (17k math problems)
│ │ ├── MATH-500/ # Validation benchmark
│ │ ├── aime24/ # AIME 2024 validation
│ │ └── aime25/ # AIME 2025 validation
│ ├── src/
│ │ ├── data/
│ │ │ ├── process_eval_data.py # Process eval datasets (train/val splits)
│ │ │ ├── prepare_length_prune_data.py # Generate length pruning prompts
│ │ │ └── prepare_self_distill_data.py # Generate self-distill prompts (with teacher solutions)
│ │ └── self_distill_hybrid/
│ │ ├── main_opsd.py # OPSD entry point
│ │ ├── opsd_trainer.py # OPSD trainer (JSD/reverse-KL loss)
│ │ ├── opsd_worker.py # OPSD FSDP worker
│ │ ├── sd_worker.py # Base self-distill worker
│ │ ├── sd_dataset.py # Dataset for paired teacher/student prompts
│ │ └── sd_verifier.py # Math answer verification
│ ├── scripts/sft/
│ │ └── train_opsd.sh # Main training launch script
│ └── execution-configs/ # Hyperparameter configs for Qwen3-8B and 14B
- 8x H100/H200 GPUs (80GB)
- Python 3.10+
- CUDA 12.4+
git clone https://github.com/HJSang/OPSD_Reasoning_Compression.git
cd OPSD_Reasoning_Compression
# Install VERL and dependencies
cd verl
pip install -e .
cd ..
# Install additional dependencies
pip install sglang pandas datasets hydra-core omegaconfThe full pipeline has 3 stages:
Process DAPO-Math-17k-dedup into train/val splits and prepare validation benchmarks (MATH-500, AIME 2024, AIME 2025).
cd workspace/src/data
python process_eval_data.py \
--data_dir ../../data \
--output_dir ../../data/processedThis produces:
data/processed/train.parquet— DAPO training split (95%)data/processed/val_dapo.parquet— DAPO validation split (5%)data/processed/val_math500.parquet,val_aime24.parquet,val_aime25.parquet— Evaluation benchmarks
Create paired teacher/student prompts for OPSD training. The teacher prompt adds a conciseness instruction; the student prompt is the original DAPO-Math prompt unchanged.
# Batch mode (recommended) — generates all 4 variants with shared 80/20 split:
python prepare_length_prune_data.py batch \
--input-parquet ../../data/DAPO-Math-17k-dedup/distinct-prompts-with-rewards.parquet \
--output-root ../../data
# This creates:
# data/length_prune_concise/ — "Solve concisely" teacher prompt
# data/length_prune_20pct/ — "Use 20% fewer tokens" teacher prompt
# data/length_prune_50pct/ — "Use 50% fewer tokens" teacher prompt
# data/length_prune_80pct/ — "Use 80% fewer tokens" teacher prompt
#
# Each directory contains:
# self_distill_prompts.parquet — Training prompts
# self_distill_prompts_val.parquet — Validation promptsLaunch OPSD training using the VERL HybridEngine (sglang for generation + FSDP for training).
MODEL_PATH=/path/to/Qwen3-8B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.shMODEL_PATH=/path/to/Qwen3-14B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.shPre-configured hyperparameter files for various ablations (teacher update frequency, compression strength) are available in workspace/execution-configs/.
| Parameter | Default | Description |
|---|---|---|
OPSD_LOSS_TYPE |
reverse_kl |
Loss type: reverse_kl or jsd |
OPSD_BETA |
0.5 |
JSD interpolation weight (only used when jsd) |
TEACHER_UPDATE_FREQ |
50 |
Steps between teacher weight updates (0 = frozen teacher) |
SD_TEMPERATURE |
1.0 |
Student rollout temperature |
SD_MAX_TOKENS |
8192 |
Max tokens for student generation |
SFT_MAX_LENGTH |
10240 |
Max sequence length for training |
CHECK_STRUCTURE |
false |
Whether to require <think> tags in responses |
USE_LIGER |
true |
Memory-efficient loss via logsumexp |
- Generate: sglang produces student responses from question-only prompts
- Score: Teacher forward pass computes logits on student-generated tokens using the conciseness-augmented prompt
- Train: Minimize per-token reverse KL between student and teacher distributions on ALL responses (no correctness filtering)
- Sync: Updated weights are automatically synced back to sglang for the next generation step
- Refresh teacher: Every
TEACHER_UPDATE_FREQsteps, copy student weights to teacher for progressive compression
Built on top of VERL (HybridEngine for combined generation and training).
@article{sang2025crisp,
title={CRISP: Compressed Reasoning via Iterative Self-Policy Distillation},
author={Sang, Hejian and Xu, Yuanda and Zhou, Zhengze and He, Ran and Wang, Zhipeng and Sun, Jiachen},
journal={arXiv preprint arXiv:2603.05433},
year={2026}
}