Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as
We propose Trajectory Bellman Residual Minimization (TBRM), a simple and effective value-based reinforcement learning (RL) algorithm tailored for large language models (LLMs). TBRM directly leverages the model logits to parameterize the Q-values, value function, and policy under a KL-regularized RL framework:
In many LLM applications, reward signals are only available at the trajectory level (e.g., based on final answers), making token-level credit assignment difficult. To address this, TBRM introduces a trajectory-level variant of Bellman Residual Minimization, which optimizes a single squared residual per trajectory:
On six mathematical-reasoning benchmarks: AIME24/25, AMC23, MATH500, Minerva-Math, and OlympiadBench, TBRM consistently outperforms PPO and GRPO baselines. Notably, TBRM achieves up to 30.5% accuracy on AIME24 with Qwen2.5-Math-7B. Compared to GRPO, it improves the average benchmark score by 1.3% absolute, while under comparable conditions to PPO, it achieves better performance with 22.5% less training time and 33% lower GPU memory. We further demonstrate that TBRM benefits from additional rollouts but already surpasses PPO with one rollout per prompt and the model learns emergent reasoning patterns, such as verification, backtracking, and decomposition, that align with human mathematical practice.
Our implementation builds on the VERL framework. The core components of TBRM are organized as follows:
-
./verl/tbrm: Main directory containing TBRM-specific logic.
-
ray_trainer.py: Coordinates the full training pipeline, including rollout generation, reward computation, and policy updates.
-
dp_actor.py: Implements the TBRM policy update step.
-
core_algos.py: Defines the TBRM loss function.
-
Create a new
condaenvironmentconda create -y -n value python=3.10 conda activate value
-
Install dependencies
git clone https://github.com/rlx-lab/TBRM.git cd tbrm pip install -e . pip3 install vllm==0.8.3 pip3 install flash-attn==2.7.4.post1 --no-build-isolation
-
Log in wandb and huggingface if needed
wandb login [YOUR TOKEN] huggingface-cli login --token [YOUR TOKEN]
-
Prepare the training and test datasets
python examples/data_preprocess/tbrm_math.py \ --hf_repo_id RyanYr/value-llm_DAPO-Math-17k \ --split train \ --system_prompt qwen-math \ --append_instruction null \ --output_path dapo_ds_train_sample.parquet python examples/data_preprocess/tbrm_math.py \ --hf_repo_id RyanYr/MathEval \ --split test \ --output_path matheval.parquet \ --filter "data_source:math_eval_aime24" -
Start training
Set
SZto the model size you want to train, andhuggingface_repoto the path of the Hugging Face repository where checkpoints will be saved. The following command will reproduce our results forQwen2.5-Math-7B:SZ=7 DATASET=dapo HF_ID=${huggingface_repo} bash examples/tbrm_trainer/run_qwen2.5math.shChange
SZto1.5to train withQwen2.5-Math-1.5B.
Suppose your model checkpoint is stored on Hugging Face under path model_hf_repo. To run benchmark evaluations, start by downloading the checkpoint locally.
python scripts/hf_download.py \
--hf_id "${model_hf_repo}" \
--revision "${commit}" \
--local_path "$model_local_dir" \
--ignore_patterns optim*To convert the checkpoint into a format compatible with vLLM, use the script model_merger.py. Set base_model to the path of the base model that includes the tokenizer, e.g., base_model=Qwen/Qwen2.5-Math-7B.
python scripts/model_merger.py \
--backend fsdp \
--target_dir "$model_local_dir/huggingface" \
--hf_model_path $base_model \
--local_dir "$model_local_dir"We prepare evaluation benchmarks.
python examples/data_preprocess/tbrm_math.py \
--hf_repo_id RyanYr/MathEval \
--split test \
--output_path matheval_single.parquet \
--system_prompt qwen-math \
--append_instruction qwen-math \
--shuffle \
--filter "data_source:math_eval_math500|data_source:math_eval_olympiadbench|data_source:math_eval_minerva_math"
python examples/data_preprocess/tbrm_math.py \
--hf_repo_id RyanYr/MathEval \
--split test \
--output_path matheval_multi.parquet \
--system_prompt qwen-math \
--append_instruction qwen-math \
--shuffle \
--filter "data_source:math_eval_aime24|data_source:math_eval_aime25|data_source:math_eval_amc23"Finally, run vllm_infer.py to generate responses.
model_path="$model_local_dir/huggingface"
benchmark="single" # Either single or multi
K=1 # number of responses per question
temperature="0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
--model_name_or_path ${model_path} \
--revision "main" \
--tokenizer_name_or_path ${base_model} \
--dataset_name_or_path "matheval_${benchmark}.parquet" \
--split "train" \
--output_hf_dataset "${output_hf_repo}" \
--output_split "output" \
--K ${K} \
--max_new_tokens ${max_new_tokens} \
--seed 42 \
--temperature "${temperature}" \
--prompt_column "prompt" \
--eval_result_output_path "${eval_output_path}" \
--role main \
--ngpus $NGPUS \
--eval TrueTo sample mulitple answers per prompt, set K to the number of answers, e.g.,
model_path="$model_local_dir/huggingface"
benchmark="multi" # Either single or multi
K=32 # number of responses per question
temperature="1.0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
--model_name_or_path ${model_path} \
--revision "main" \
--tokenizer_name_or_path ${base_model} \
--dataset_name_or_path "matheval_${benchmark}.parquet" \
--split "train" \
--output_hf_dataset "${output_hf_repo}" \
--output_split "output" \
--K ${K} \
--max_new_tokens ${max_new_tokens} \
--seed 42 \
--temperature "${temperature}" \
--prompt_column "prompt" \
--eval_result_output_path "${eval_output_path}" \
--role main \
--ngpus $NGPUS \
--eval TrueWe sincerely thank VERL for providing the excellent codebase.


