🔥 Our new work Life-Harness is open-sourced: https://github.com/Tianshi-Xu/Life-Harness. Training free, Evolving Harness with performance improvement up to 1468% !
- We propose CLEANER, which resolves the credit assignment dilemma in agentic RL by training on self-purified trajectories. This enables models to internalize correct reasoning patterns while filtering out execution noise.
- We introduce the SAAR mechanism to autonomously construct these clean signals. SAAR adaptively repairs failures—ranging from minor syntax typos to deep logical flaws—without the computational overhead of supersampling.
- We show state-of-the-art efficiency and performance. CLEANER improves accuracy by 6% on AIME and 5% on LiveCodeBench, and matches SOTA performance using only one-third of the training steps.
- We provide a fully reproducible pipeline and release code, environment configs, and processed datasets via GitHub to support further research.
This project is organized and extended based on DemyAgent. The training datasets are also sourced from that project; if you use them, please cite it as well (see the citation at the end).
CLEANER is trained with the VeRL framework. We recommend using the latest VeRL to ensure the best performance (e.g., keep a fork updated). A Docker-based setup is recommended; a working example is shown below. As VeRL evolves, image tags may change, but the following image is compatible with the latest version at the time of writing:
docker pull verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2
docker create --runtime=nvidia --gpus all --net=host --shm-size="10g" --cap-add=SYS_ADMIN --name verl IMAGE_ID sleep infinity
docker start verl
docker exec -it verl bashOther official VeRL installation options are supported; see the VeRL documentation.
Inside the Docker container, clone this repository and install dependencies:
sudo apt-get update -y
sudo apt install tmux -y
sudo apt-get install -y libglib2.0-0
sudo apt-get install redis -y
cd Open-CLEANER
pip install -r code-judge/requirements.txt
pip install -e code-judge
cd verl && pip install -e .[sglang] && cd ..The commands above install the required dependencies: Code-judge and VeRL.
The datasets used by CLEANER come from the DemyAgent project. We provide the original dataset links here:
Recommended dataset layout:
/Open-CLEANER
/dataset
/Open-AgentRL-30K
/Open-AgentRL-Eval
Note: For the Qwen2.5-7B model, we recommend filtering the 30K dataset to remove questions that are too easy or too difficult. The file dataset/Open-AgentRL-30K/Open-AgentRL-30K.filtered.parquet.success.jsonl records pass rates from 8 rollouts of the Qwen2.5-7B-RA-SFT model. You can filter the dataset using:
bash data_filter.shYou can tune success_rate_ratios to control retention ratios across difficulty levels. The default configuration reflects our best empirical results. The script generates a new dataset file at dataset/Open-AgentRL-30K/filtered_sr.parquet, which can be used directly to train CLEANER-7B.
Since CLEANER primarily improves the RL stage, we recommend downloading the Cold-Start SFT checkpoints provided by DemyAgent to save time:
| Model | Link |
|---|---|
| Qwen2.5-7B-RA-SFT | 🤗 HuggingFace |
| Qwen3-4B-RA-SFT | 🤗 HuggingFace |
After downloading the datasets and SFT checkpoints, follow the steps below to start RL training.
-
Start the
code-judgeservice (used to execute tool-call code generated by the model):-
Configure
code_judge_urlinverl/utils/reward_score/livecodebench/code_math.py(it is already set by default). -
Run the following commands to start the service:
redis-server --daemonize yes --protected-mode no --bind 0.0.0.0 redis-cli ping tmux new-session -d -s server 'cd ./code-judge && MAX_EXECUTION_TIME=4 REDIS_URI="redis://localhost:6379" RUN_WORKERS=0 uvicorn app.main:app --host 0.0.0.0 --port 8088 --workers 16 2>&1 | tee server.log' tmux new-session -d -s worker 'cd ./code-judge && MAX_EXECUTION_TIME=4 REDIS_URI="redis://localhost:6379" MAX_WORKERS=64 python run_workers.py 2>&1 | tee worker.log'
-
Verify the service:
curl -X POST http://0.0.0.0:8088/judge -H "Content-Type: application/json" -d '{"type":"python","solution":"print(\"hello world\")","expected_output":"hello world\n"}'
-
-
The
recipe/cleanerfolder contains scripts for training CLEANER and baselines (including baseline-notool). Runbash recipe/cleaner/qwen3_4b_cleaner.shto train CLEANER-4B; other scripts are similar. Key parameters to configure:-
train_dataset: RL dataset path, e.g.,dataset/Open-AgentRL-30K/Open-AgentRL-30K.parquet. Other dataset/model paths follow the same pattern. -
WANDB_API_KEY: your Weights & Biases API key for logging. You can also login via CLI to avoid setting this in the script. -
OUTPUT_DIR: output path for checkpoints and logs. -
num_GPU: number of GPUs to use. -
resume_dir: resume path; addtrainer.resume_mode=resume_pathandtrainer.resume_from_path=$resume_dirwhen resuming.
-
-
CLEANER-specific parameters that control the SAAR mechanism:
-
+actor_rollout_ref.rollout.multi_turn.enable_tool_rollback: setTrueto enable SAAR. Requires+actor_rollout_ref.rollout.multi_turn.max_tool_retries(we recommend3). - SAAR similarity threshold
$\gamma$ : inverl/verl/experimental/agent_loop/tool_agent_loop.py, replace0.5inshould_replace_reasoning = (similarity < 0.5)with your desired threshold. -
+actor_rollout_ref.rollout.multi_turn.rollback_probability: rollback probability. We recommend0.7for Qwen2.5-7B and1.0for Qwen3-4B. -
actor_rollout_ref.actor.fsdp_config.dtypeandactor_rollout_ref.rollout.dtype: we recommend setting both tofloat16to reduce train/infer mismatch. -
+algorithm.rollout_correction.rollout_is: enables IS correction for train/infer mismatch. CLEANER should disable this; otherwise it introduces logits recomputation for overwritten segments. Baselines can enable it, though we found limited gains. -
save_negative_samples,use_dpo_on_tool_calls,dpo_beta, anddpo_max_adjustment_ratio: parameters for negative-sample DPO (a failed attempt; see the appendix). Keep them disabled by default.
-
For training baselines like DAPO-baseline, you can directly use the scripts:
recipe/cleaner/qwen3_4b.shrecipe/cleaner/qwen2_7b.sh
These correspond to the DAPO-baseline configurations.
If you want to train DemyAgent-4B, you can start from the same scripts and simply:
- reduce the learning rate to 1e-6
- increase the number of training epochs by 3×
| Model | Link |
|---|---|
| CLEANER-4B | 🤗 HuggingFace |
Evaluate AIME and GPQA:
bash recipe/cleaner/eval/eval_qwen3_4b_aime_gpqa.shFor CLEANER evaluation, enable +actor_rollout_ref.rollout.multi_turn.enable_tool_rollback=True and +actor_rollout_ref.rollout.multi_turn.max_tool_retries=3 by default.
Evaluate LiveCodeBench:
bash recipe/cleaner/eval/eval_qwen3_4b_livecodebench.shThis generates trajectories under VAL_SAVE_PATH. Follow the Custom Evaluation section in the LiveCodeBench repository. You will need to align the generated trajectories with their question_id and convert them to the LiveCodeBench format. The question_id is in dataset/Open-AgentRL-Eval/livecodebench-v6/lcb_v6_2502_2505.parquet.
The core implementation lives in verl/verl/experimental/agent_loop/tool_agent_loop.py (all changes are marked with #CLEANER). Key flow:
- Entry & config:
ToolAgentLoop.__init__()wires rollback parameters;RollbackManagercontrols enablement, max retries, and error-pattern matching. - Trigger & detection:
_handle_processing_tools_state()executes tool calls, using_detect_errors()/_detect_error_from_text()to classify failures from tool responses. - Rollback core:
_handle_rollback()appends error feedback, regenerates tool calls, and uses_overwrite_last_assistant_turn()to either replace only the tool-call segment or the full assistant turn. - Token-level replacement:
_split_tool_call_segment()and_find_tool_call_token_boundary()locate tool-call boundaries at the token level to preserve reasoning prefixes when possible. - Stats & samples:
AgentDatatracks retries, tool-call stats, and optional negative samples;_call_tool()records first-attempt success/failure reasons.
Start from ToolAgentLoop._handle_rollback() and follow _handle_processing_tools_state() → _overwrite_last_assistant_turn() → _split_tool_call_segment() to understand the full rollback path.
If you use Open-CLEANER, please cite our paper:
@misc{xu2026cleanerselfpurifiedtrajectoriesboost,
title={CLEANER: Self-Purified Trajectories Boost Agentic Reinforcement Learning},
author={Tianshi Xu and Yuteng Chen and Meng Li},
year={2026},
eprint={2601.15141},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2601.15141}
}Note: Please also cite the original DemyAgent paper if you use datasets or code from that project.
