Fixing GRPO training collapse in long-horizon multi-tool agents. A lightweight PRM-Lite + LATA joint approach achieves +37% over vanilla GRPO on τ-bench airline (50-task, multi-turn).
reinforcement-learning long-horizon qwen agentic-ai tool-calling process-reward-model grpo tau-bench multi-turn-agents
-
Updated
May 11, 2026 - Python