🤖 AI Summary
This work addresses the lack of convergence guarantees in existing large language model agents during multi-turn interactive reinforcement learning, which often leads to unstable training. The authors model multi-turn interactions as a sequentially executed multi-agent bandit problem and propose the first critic-free, sequence-level policy optimization method with provable convergence. By leveraging backward induction, sequential policy updates at the sequence level, and Group Relative Advantage Estimation (GRAE), the method ensures monotonic policy improvement and convergence to a global optimum. Evaluated on the AppWorld and BFCL v4 benchmarks, the approach achieves relative performance gains of 43.3%–54.6% and 24.1%–41.9% over strong baselines when applied to Qwen3-14B and Qwen2.5-14B, respectively, significantly outperforming state-of-the-art algorithms while demonstrating exceptional training stability.
📝 Abstract
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO&GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence guarantees in multi-turn scenarios. To address this, we propose SeeUPO (Sequence-level Sequential Update Policy Optimization), a critic-free approach with convergence guarantees for multi-turn interactions. SeeUPO models multi-turn interaction as sequentially executed multi-agent bandit problems. Through turn-by-turn sequential policy updates in reverse execution order, it ensures monotonic improvement and convergence to global optimal solution via backward induction. Experiments on AppWorld and BFCL v4 demonstrate SeeUPO's substantial improvements over existing backbone algorithms: relative gains of 43.3%-54.6% on Qwen3-14B and 24.1%-41.9% on Qwen2.5-14B (averaged across benchmarks), along with superior training stability.