Skip to content

rlx-lab/TBRM

Repository files navigation

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Paper Github

Introduction

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model's own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

Method

We propose Trajectory Bellman Residual Minimization (TBRM), a simple and effective value-based reinforcement learning (RL) algorithm tailored for large language models (LLMs). TBRM directly leverages the model logits to parameterize the Q-values, value function, and policy under a KL-regularized RL framework:

In many LLM applications, reward signals are only available at the trajectory level (e.g., based on final answers), making token-level credit assignment difficult. To address this, TBRM introduces a trajectory-level variant of Bellman Residual Minimization, which optimizes a single squared residual per trajectory:

On six mathematical-reasoning benchmarks: AIME24/25, AMC23, MATH500, Minerva-Math, and OlympiadBench, TBRM consistently outperforms PPO and GRPO baselines. Notably, TBRM achieves up to 30.5% accuracy on AIME24 with Qwen2.5-Math-7B. Compared to GRPO, it improves the average benchmark score by 1.3% absolute, while under comparable conditions to PPO, it achieves better performance with 22.5% less training time and 33% lower GPU memory. We further demonstrate that TBRM benefits from additional rollouts but already surpasses PPO with one rollout per prompt and the model learns emergent reasoning patterns, such as verification, backtracking, and decomposition, that align with human mathematical practice.

Codebase Structure

Our implementation builds on the VERL framework. The core components of TBRM are organized as follows:

  • ./verl/tbrm: Main directory containing TBRM-specific logic.

  • ray_trainer.py: Coordinates the full training pipeline, including rollout generation, reward computation, and policy updates.

  • dp_actor.py: Implements the TBRM policy update step.

  • core_algos.py: Defines the TBRM loss function.

Environment Setup

  1. Create a new conda environment

    conda create -y -n value python=3.10
    conda activate value
  2. Install dependencies

    git clone https://github.com/rlx-lab/TBRM.git
    cd tbrm
    pip install -e .
    pip3 install vllm==0.8.3
    pip3 install flash-attn==2.7.4.post1 --no-build-isolation
  3. Log in wandb and huggingface if needed

    wandb login [YOUR TOKEN]
    huggingface-cli login --token [YOUR TOKEN]

Experiments Running

  1. Prepare the training and test datasets

    python examples/data_preprocess/tbrm_math.py \
        --hf_repo_id RyanYr/value-llm_DAPO-Math-17k \
        --split train \
        --system_prompt qwen-math \
        --append_instruction null \
        --output_path dapo_ds_train_sample.parquet
    
    python examples/data_preprocess/tbrm_math.py \
        --hf_repo_id RyanYr/MathEval \
        --split test \
        --output_path matheval.parquet \
        --filter "data_source:math_eval_aime24"
  2. Start training

    Set SZ to the model size you want to train, and huggingface_repo to the path of the Hugging Face repository where checkpoints will be saved. The following command will reproduce our results for Qwen2.5-Math-7B:

    SZ=7 DATASET=dapo HF_ID=${huggingface_repo} bash examples/tbrm_trainer/run_qwen2.5math.sh

    Change SZ to 1.5 to train with Qwen2.5-Math-1.5B.

Evaluation

Suppose your model checkpoint is stored on Hugging Face under path model_hf_repo. To run benchmark evaluations, start by downloading the checkpoint locally.

python scripts/hf_download.py \
    --hf_id "${model_hf_repo}" \
    --revision "${commit}" \
    --local_path "$model_local_dir" \
    --ignore_patterns optim*

To convert the checkpoint into a format compatible with vLLM, use the script model_merger.py. Set base_model to the path of the base model that includes the tokenizer, e.g., base_model=Qwen/Qwen2.5-Math-7B.

python scripts/model_merger.py \
    --backend fsdp \
    --target_dir "$model_local_dir/huggingface" \
    --hf_model_path $base_model \
    --local_dir "$model_local_dir"

We prepare evaluation benchmarks.

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/MathEval \
    --split test \
    --output_path matheval_single.parquet \
    --system_prompt qwen-math \
    --append_instruction qwen-math \
    --shuffle \
    --filter "data_source:math_eval_math500|data_source:math_eval_olympiadbench|data_source:math_eval_minerva_math"

python examples/data_preprocess/tbrm_math.py \
    --hf_repo_id RyanYr/MathEval \
    --split test \
    --output_path matheval_multi.parquet \
    --system_prompt qwen-math \
    --append_instruction qwen-math \
    --shuffle \
    --filter "data_source:math_eval_aime24|data_source:math_eval_aime25|data_source:math_eval_amc23"

Finally, run vllm_infer.py to generate responses.

model_path="$model_local_dir/huggingface"
benchmark="single" # Either single or multi
K=1 # number of responses per question
temperature="0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
    --model_name_or_path ${model_path} \
    --revision "main" \
    --tokenizer_name_or_path ${base_model} \
    --dataset_name_or_path "matheval_${benchmark}.parquet" \
    --split "train" \
    --output_hf_dataset "${output_hf_repo}" \
    --output_split "output" \
    --K ${K} \
    --max_new_tokens ${max_new_tokens} \
    --seed 42 \
    --temperature "${temperature}" \
    --prompt_column "prompt" \
    --eval_result_output_path "${eval_output_path}" \
    --role main \
    --ngpus $NGPUS \
    --eval True

To sample mulitple answers per prompt, set K to the number of answers, e.g.,

model_path="$model_local_dir/huggingface"
benchmark="multi" # Either single or multi
K=32 # number of responses per question
temperature="1.0" # Set to non-zero value if K > 1
max_new_tokens=3072
output_hf_repo=... # Set to your huggingface repo to store the generated responses
python scripts/vllm_infer.py \
    --model_name_or_path ${model_path} \
    --revision "main" \
    --tokenizer_name_or_path ${base_model} \
    --dataset_name_or_path "matheval_${benchmark}.parquet" \
    --split "train" \
    --output_hf_dataset "${output_hf_repo}" \
    --output_split "output" \
    --K ${K} \
    --max_new_tokens ${max_new_tokens} \
    --seed 42 \
    --temperature "${temperature}" \
    --prompt_column "prompt" \
    --eval_result_output_path "${eval_output_path}" \
    --role main \
    --ngpus $NGPUS \
    --eval True

Acknowledgement

We sincerely thank VERL for providing the excellent codebase.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors