Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Introduction

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

Method

We propose Trajectory Bellman Residual Minimization (TBRM), a simple and effective value-based reinforcement learning (RL) algorithm tailored for large language models (LLMs). TBRM directly leverages the model logits to parameterize the Q-values, value function, and policy under a KL-regularized RL framework:

In many LLM applications, reward signals are only available at the trajectory level (e.g., based on final answers), making token-level credit assignment difficult. To address this, TBRM introduces a trajectory-level variant of Bellman Residual Minimization, which optimizes a single squared residual per trajectory:

On six mathematical-reasoning benchmarks: AIME24/25, AMC23, MATH500, Minerva-Math, and OlympiadBench, TBRM consistently outperforms PPO and GRPO baselines. Notably, TBRM achieves up to 30.5% accuracy on AIME24 with Qwen2.5-Math-7B. Compared to GRPO, it improves the average benchmark score by 1.3% absolute, while under comparable conditions to PPO, it achieves better performance with 22.5% less training time and 33% lower GPU memory. We further demonstrate that TBRM benefits from additional rollouts but already surpasses PPO with one rollout per prompt and the model learns emergent reasoning patterns, such as verification, backtracking, and decomposition, that align with human mathematical practice.

Codebase Structure

Our implementation builds on the VERL framework. The core components of TBRM are organized as follows:

./verl/tbrm: Main directory containing TBRM-specific logic.
ray_trainer.py: Coordinates the full training pipeline, including rollout generation, reward computation, and policy updates.
dp_actor.py: Implements the TBRM policy update step.
core_algos.py: Defines the TBRM loss function.

Environment Setup

Create a new conda environment

conda create -y -n value python=3.10
conda activate value

Install dependencies

git clone https://github.com/rlx-lab/TBRM.git
cd tbrm
pip install -e .
pip3 install vllm==0.8.3
pip3 install flash-attn==2.7.4.post1 --no-build-isolation

Log in wandb and huggingface if needed

wandb login [YOUR TOKEN]
huggingface-cli login --token [YOUR TOKEN]

Experiments Running

Prepare the training and test datasets

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/value-llm_DAPO-Math-17k \
    --split train \
    --system_prompt qwen-math \
    --append_instruction null \
    --output_path dapo_ds_train_sample.parquet

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/MathEval \
    --split test \
    --output_path matheval.parquet \
    --filter "data_source:math_eval_aime24"

Start training

Set SZ to the model size you want to train, and huggingface_repo to the path of the Hugging Face repository where checkpoints will be saved. The following command will reproduce our results for Qwen2.5-Math-7B:
```
SZ=7 DATASET=dapo HF_ID=${huggingface_repo} bash examples/tbrm_trainer/run_qwen2.5math.sh
```
Change SZ to 1.5 to train with Qwen2.5-Math-1.5B.

Evaluation

Suppose your model checkpoint is stored on Hugging Face under path model_hf_repo. To run benchmark evaluations, start by downloading the checkpoint locally.

python scripts/hf_download.py \
    --hf_id "${model_hf_repo}" \
    --revision "${commit}" \
    --local_path "$model_local_dir" \
    --ignore_patterns optim*

To convert the checkpoint into a format compatible with vLLM, use the script model_merger.py. Set base_model to the path of the base model that includes the tokenizer, e.g., base_model=Qwen/Qwen2.5-Math-7B.

python scripts/model_merger.py \
    --backend fsdp \
    --target_dir "$model_local_dir/huggingface" \
    --hf_model_path $base_model \
    --local_dir "$model_local_dir"

We prepare evaluation benchmarks.

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/MathEval \
    --split test \
    --output_path matheval_single.parquet \
    --system_prompt qwen-math \
    --append_instruction qwen-math \
    --shuffle \
    --filter "data_source:math_eval_math500|data_source:math_eval_olympiadbench|data_source:math_eval_minerva_math"

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/MathEval \
    --split test \
    --output_path matheval_multi.parquet \
    --system_prompt qwen-math \
    --append_instruction qwen-math \
    --shuffle \
    --filter "data_source:math_eval_aime24|data_source:math_eval_aime25|data_source:math_eval_amc23"

Finally, run vllm_infer.py to generate responses.

model_path="$model_local_dir/huggingface"
benchmark="single" # Either single or multi
K=1 # number of responses per question
temperature="0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
    --model_name_or_path ${model_path} \
    --revision "main" \
    --tokenizer_name_or_path ${base_model} \
    --dataset_name_or_path "matheval_${benchmark}.parquet" \
    --split "train" \
    --output_hf_dataset "${output_hf_repo}" \
    --output_split "output" \
    --K ${K} \
    --max_new_tokens ${max_new_tokens} \
    --seed 42 \
    --temperature "${temperature}" \
    --prompt_column "prompt" \
    --eval_result_output_path "${eval_output_path}" \
    --role main \
    --ngpus $NGPUS \
    --eval True

To sample mulitple answers per prompt, set K to the number of answers, e.g.,

model_path="$model_local_dir/huggingface"
benchmark="multi" # Either single or multi
K=32 # number of responses per question
temperature="1.0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
    --model_name_or_path ${model_path} \
    --revision "main" \
    --tokenizer_name_or_path ${base_model} \
    --dataset_name_or_path "matheval_${benchmark}.parquet" \
    --split "train" \
    --output_hf_dataset "${output_hf_repo}" \
    --output_split "output" \
    --K ${K} \
    --max_new_tokens ${max_new_tokens} \
    --seed 42 \
    --temperature "${temperature}" \
    --prompt_column "prompt" \
    --eval_result_output_path "${eval_output_path}" \
    --role main \
    --ngpus $NGPUS \
    --eval True

Acknowledgement

We sincerely thank VERL for providing the excellent codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docker		docker
docs		docs
examples		examples
figures		figures
patches		patches
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Introduction

Method

Codebase Structure

Environment Setup

Experiments Running

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Introduction

Method

Codebase Structure

Environment Setup

Experiments Running

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages