CRISP: Compressed Reasoning via Iterative Self-Policy Distillation (Original OPSDC On-Policy Self-Distillation for Reasoning Compression)

This repository contains the code for CRISP (Compressed Reasoning via Iterative Self-Policy Distillation), a method that teaches reasoning models to think more concisely by distilling their own concise behavior back into themselves.

Paper: CRISP: Compressed Reasoning via Iterative Self-Policy Distillation | arXiv

Authors: Hejian Sang*, Yuanda Xu*, Zhengze Zhou*, Ran He*, Zhipeng Wang, Jiachen Sun

Related write-up: Scorer Choice in Math Reasoning Evaluation — a four-policy decomposition of how verifier choice (answer-extraction vs. symbolic equivalence) can swing reported MATH-500 accuracy by up to ~80 percentage points on identical generations.

Key Idea

Reasoning models think out loud, but much of what they say is noise. CRISP uses a single, almost trivial idea: ask the model to be concise, then teach it to do so without being asked.

Teacher: The same model conditioned on a conciseness instruction (e.g., "Solve concisely, avoid unnecessary steps")
Student: The same model without the conciseness instruction

Training generates student rollouts and minimizes per-token reverse KL divergence between student and teacher distributions. No ground-truth answers, no token budgets, no difficulty estimators.

Results

Model	Benchmark	Token Reduction	Accuracy Change
Qwen3-8B	MATH-500	59%	+9 pts (77% → 86%)
Qwen3-14B	MATH-500	57%	+16 pts (70% → 86%)
Qwen3-14B	AIME 2024	41%	+10 pts

Compression naturally adapts to problem difficulty (~1.6x more compression on easy vs. hard problems), entropy remains stable throughout training, and general capabilities (MMLU) are fully preserved.

Repository Structure

OnPolicySD-open/
├── verl/                          # VERL framework (forked, with minor fixes)
├── workspace/
│   ├── config/
│   │   └── prompts.json           # Prompt templates (student, teacher, length prune)
│   ├── data/
│   │   ├── DAPO-Math-17k-dedup/   # Training data (17k math problems)
│   │   ├── MATH-500/              # Validation benchmark
│   │   ├── aime24/                # AIME 2024 validation
│   │   └── aime25/                # AIME 2025 validation
│   ├── src/
│   │   ├── data/
│   │   │   ├── process_eval_data.py          # Process eval datasets (train/val splits)
│   │   │   ├── prepare_length_prune_data.py  # Generate length pruning prompts
│   │   │   └── prepare_self_distill_data.py  # Generate self-distill prompts (with teacher solutions)
│   │   └── self_distill_hybrid/
│   │       ├── main_opsd.py       # OPSD entry point
│   │       ├── opsd_trainer.py    # OPSD trainer (JSD/reverse-KL loss)
│   │       ├── opsd_worker.py     # OPSD FSDP worker
│   │       ├── sd_worker.py       # Base self-distill worker
│   │       ├── sd_dataset.py      # Dataset for paired teacher/student prompts
│   │       └── sd_verifier.py     # Math answer verification
│   ├── scripts/sft/
│   │   └── train_opsd.sh          # Main training launch script
│   └── execution-configs/         # Hyperparameter configs for Qwen3-8B and 14B

Setup

Prerequisites

8x H100/H200 GPUs (80GB)
Python 3.10+
CUDA 12.4+

Installation

git clone https://github.com/HJSang/OPSD_Reasoning_Compression.git
cd OPSD_Reasoning_Compression

# Install VERL and dependencies
cd verl
pip install -e .
cd ..

# Install additional dependencies
pip install sglang pandas datasets hydra-core omegaconf

Quick Start

The full pipeline has 3 stages:

Stage 1: Process Evaluation Data

Process DAPO-Math-17k-dedup into train/val splits and prepare validation benchmarks (MATH-500, AIME 2024, AIME 2025).

cd workspace/src/data

python process_eval_data.py \
    --data_dir ../../data \
    --output_dir ../../data/processed

This produces:

data/processed/train.parquet — DAPO training split (95%)
data/processed/val_dapo.parquet — DAPO validation split (5%)
data/processed/val_math500.parquet, val_aime24.parquet, val_aime25.parquet — Evaluation benchmarks

Stage 2: Generate Length Pruning Prompts

Create paired teacher/student prompts for OPSD training. The teacher prompt adds a conciseness instruction; the student prompt is the original DAPO-Math prompt unchanged.

# Batch mode (recommended) — generates all 4 variants with shared 80/20 split:
python prepare_length_prune_data.py batch \
    --input-parquet ../../data/DAPO-Math-17k-dedup/distinct-prompts-with-rewards.parquet \
    --output-root ../../data

# This creates:
#   data/length_prune_concise/     — "Solve concisely" teacher prompt
#   data/length_prune_20pct/       — "Use 20% fewer tokens" teacher prompt
#   data/length_prune_50pct/       — "Use 50% fewer tokens" teacher prompt
#   data/length_prune_80pct/       — "Use 80% fewer tokens" teacher prompt
#
# Each directory contains:
#   self_distill_prompts.parquet       — Training prompts
#   self_distill_prompts_val.parquet   — Validation prompts

Stage 3: Train OPSD

Launch OPSD training using the VERL HybridEngine (sglang for generation + FSDP for training).

Qwen3-8B

MODEL_PATH=/path/to/Qwen3-8B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.sh

Qwen3-14B

MODEL_PATH=/path/to/Qwen3-14B \
SD_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts.parquet \
SD_VAL_PROMPTS_PATH=./workspace/data/length_prune_concise/self_distill_prompts_val.parquet \
OPSD_BETA=0.5 \
SD_TEMPERATURE=1.0 \
SD_TOP_P=1.0 \
SD_MAX_TOKENS=8192 \
SFT_MAX_LENGTH=10240 \
TOTAL_EPOCHS=1 \
TRAIN_BATCH_SIZE=32 \
MICRO_BATCH_SIZE=2 \
LEARNING_RATE=1e-6 \
TP_SIZE=2 \
GPU_MEM_UTIL=0.75 \
ULYSSES_SP_SIZE=4 \
MAX_PROMPT_LENGTH=1024 \
MAX_RESPONSE_LENGTH=30000 \
VAL_MAX_TOKENS=30000 \
CHECK_STRUCTURE=false \
USE_LIGER=true \
OPSD_LOSS_TYPE=reverse_kl \
TEACHER_UPDATE_FREQ=50 \
EXPERIMENT_NAME=opsd_length_prune_concise \
bash workspace/scripts/sft/train_opsd.sh

Pre-configured hyperparameter files for various ablations (teacher update frequency, compression strength) are available in workspace/execution-configs/.

Key Hyperparameters

Parameter	Default	Description
`OPSD_LOSS_TYPE`	`reverse_kl`	Loss type: `reverse_kl` or `jsd`
`OPSD_BETA`	`0.5`	JSD interpolation weight (only used when `jsd`)
`TEACHER_UPDATE_FREQ`	`50`	Steps between teacher weight updates (0 = frozen teacher)
`SD_TEMPERATURE`	`1.0`	Student rollout temperature
`SD_MAX_TOKENS`	`8192`	Max tokens for student generation
`SFT_MAX_LENGTH`	`10240`	Max sequence length for training
`CHECK_STRUCTURE`	`false`	Whether to require `<think>` tags in responses
`USE_LIGER`	`true`	Memory-efficient loss via logsumexp

How It Works

Generate: sglang produces student responses from question-only prompts
Score: Teacher forward pass computes logits on student-generated tokens using the conciseness-augmented prompt
Train: Minimize per-token reverse KL between student and teacher distributions on ALL responses (no correctness filtering)
Sync: Updated weights are automatically synced back to sglang for the next generation step
Refresh teacher: Every TEACHER_UPDATE_FREQ steps, copy student weights to teacher for progressive compression

Acknowledgments

Built on top of VERL (HybridEngine for combined generation and training).

Citation

@article{sang2025crisp,
  title={CRISP: Compressed Reasoning via Iterative Self-Policy Distillation},
  author={Sang, Hejian and Xu, Yuanda and Zhou, Zhengze and He, Ran and Wang, Zhipeng and Sun, Jiachen},
  journal={arXiv preprint arXiv:2603.05433},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
workspace		workspace
README.md		README.md
crisp_compressed_reasoning_via_iterative_self_policy_distillation.pdf		crisp_compressed_reasoning_via_iterative_self_policy_distillation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation (Original OPSDC On-Policy Self-Distillation for Reasoning Compression)

Key Idea

Results

Repository Structure

Setup

Prerequisites

Installation

Quick Start

Stage 1: Process Evaluation Data

Stage 2: Generate Length Pruning Prompts

Stage 3: Train OPSD

Qwen3-8B

Qwen3-14B

Key Hyperparameters

How It Works

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation (Original OPSDC On-Policy Self-Distillation for Reasoning Compression)

Key Idea

Results

Repository Structure

Setup

Prerequisites

Installation

Quick Start

Stage 1: Process Evaluation Data

Stage 2: Generate Length Pruning Prompts

Stage 3: Train OPSD

Qwen3-8B

Qwen3-14B

Key Hyperparameters

How It Works

Acknowledgments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages