⚠️ Important Notice: This repository is a reproduction of the ASTER work. We follow the details described in the original paper, but cannot guarantee complete consistency with the original implementation. A potentially significant difference is the training hardware/platform — we reproduce on NVIDIA GPUs, while the original work was trained on Huawei Ascend 910.
- [2026.02.13] We open-source our full reproduction of ASTER, including:
- Dataset:
- Model: ASTER_4B
- Training code for both SFT and RL stages
- Evaluation scripts
ASTER (Agentic Scaling with Tool-integrated Extended Reasoning) is a two-stage framework that combines targeted cold-start supervised fine-tuning with reinforcement learning to scale Tool-Integrated Reasoning (TIR) capabilities in Large Language Models.
Traditional approaches to Tool-Integrated Reasoning often suffer from interaction collapse during RL training: a pathological failure mode where models fail to sustain multi-turn tool usage, instead degenerating into extensive internal reasoning followed by trivial, post-hoc code verification, rather than engaging in genuine iterative planning.
ASTER addresses this challenge through an interaction-dense cold-start strategy:
- Interaction Density: Prioritizing long trajectories with multiple tool invocations
- Behavioral Prior: Establishing multi-turn tool-use behavioral patterns through cold-start SFT
- Multi-stage RL: Employing a two-stage curriculum learning strategy that progressively increases context length from 18K to 32K tokens
- Two-Stage Training Pipeline:
- Stage 1: Cold-start SFT with interaction-dense trajectories (4K samples, 9+ tool-interaction turns each)
- Stage 2: Multi-stage RL using Group Relative Policy Optimization (GRPO) with progressive context length scaling
- Interaction-Dense Design: Long trajectories with frequent tool invocations to prevent interaction collapse
- Stateful Tool Environment: Python sandbox that persists execution state across turns
- Outcome-Only RL: Binary reward based on final answer correctness, enabling efficient learning from sparse feedback
ASTER achieves state-of-the-art performance across competitive mathematical reasoning benchmarks:
Evaluation Parameters:
top_p = 1.0temperature = 1.0max_assistant_turns = 256
| Benchmark | Accuracy |
|---|---|
| AIME 2025 | 87.7% |
| HMMT 2025 | 77.1% |
Note: The AIME 2025 generation results are available in asserts/aime25.jsonl.
Our RL experiments use Qwen3-4B-Thinking-2507 as the backbone model and DAPO-17K as the training dataset.
First, download the dataset and place it in the correct location:
# Install huggingface_hub before downloading the dataset
# pip install huggingface_hub
# Download dataset from HuggingFace
# Method 1: Using huggingface-cli (provided by huggingface_hub package)
huggingface-cli download QuantumStackOverflow/Aster_SFT4K train.parquet --repo-type dataset --local-dir llamafactory/data --local-dir-use-symlinks False
# Or you can try:
# python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='QuantumStackOverflow/Aster_SFT4K', repo_type='dataset', local_dir='llamafactory/data', local_dir_use_symlinks=False)"
# Method 2: Manual download and move file
# After downloading train.parquet, move it to llamafactory/data/aster_sft.parquet
mv llamafactory/data/train.parquet llamafactory/data/aster_sft.parquetEdit the llamafactory/examples/train_full/qwen3_full_sft.yaml file and set the following required parameters:
model_name_or_path: Base model path (e.g.,Qwen/Qwen3-4B-Thinking-2507or local path)dataset_dir: Dataset directory path (e.g.,llamafactory/data)output_dir: Model output directory path
cd llamafactory
python -m venv llamafactory_env
source llamafactory_env/bin/activate
pip install -r requirements.txt
pip install liger-kernel tensorboard
pip install -e .
# Method 1: Override configuration via command line arguments (recommended)
bash scripts/sft_qwen3_4b.sh \
model_name_or_path=Qwen/Qwen3-4B-Thinking-2507 \
dataset_dir=data \
output_dir=ckpts
# Method 2: Use default configuration (requires setting all paths in yaml file)
bash scripts/sft_qwen3_4b.shNote:
- Training logs are automatically saved to
llamafactory/ckpts/train_sft_YYYYMMDD_HHMMSS.log - Adjust batch size via
per_device_train_batch_sizeandgradient_accumulation_stepsto fit memory constraints
⚠️ Important: Our sandbox is stateful — it persists execution state across multiple tool invocations, which is crucial for multi-turn tool-integrated reasoning. Please follow the installation guide carefully.
See detailed installation instructions in sandboxfusion/README.md.
- Install training env
conda create -n aster python=3.11.13
conda activate aster
pip install 'vllm==0.10.2'
pip install -r requirements.txt
pip install flash_attn==2.8.3 --no-build-isolation
pip install -e .- Get training and validation data
# Train 4B model for the first stage with 3 epochs
bash scripts/train/train_4b_first_stage.sh- Filter all correct data from the last epoch of the first stage
python scripts/train/filter_all_correct.py \
--dataset ./datasets/dapo.parquet \
--rollout-dir ./final_ckpts/Aster_4B_maxturns_50_maxresp_16384/rollout_data \
--output ./datasets/dapo_filter_all_correct.parquet \
--step-min 271 \
--step-max 405 \
--plot \
--threads 32- Convert checkpoints
python -m verl.model_merger merge \
--backend fsdp \
--local_dir ./final_ckpts/Aster_4B_maxturns_50_maxresp_16384/global_step_405/actor \
--target_dir ./final_ckpts/Aster_4B_maxturns_50_maxresp_16384/global_step_405/actor/huggingface
mv ./final_ckpts/Aster_4B_maxturns_50_maxresp_16384/global_step_405/actor/huggingface /root/ckpts/Aster_4B_maxturns_50_maxresp_16384_step405- Train the 4B model with filtered data. Note: Remember to update the model path to the first stage model path.
bash scripts/train/train_4b_second_stage.sh- Thanks to the authors of the ASTER paper for their detailed work
- Built on top of Qwen3-4B base models
- Uses DAPO as the training dataset
- Uses LlamaFactory for multi-turn SFT
- Uses SandboxFusion for RL training
- Uses verl framework for RL training
@misc{zhang2026aster,
title = {ASTER: Agentic Scaling with Tool-integrated Extended Reasoning},
author = {Zhang, Xuqin and He, Quan and Zheng, Zhenrui and Zhang, Zongzhang and He, Xu and Li, Dong},
year = {2026},
eprint = {2602.01204},
archivePrefix= {arXiv},
primaryClass = {cs.CL}
}
