AT2PO: Agentic Turn-based Policy Optimization via Tree Search

📅 2026-01-08

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses three key challenges in multi-agent reinforcement learning: insufficient exploration diversity, difficulty in credit assignment under sparse rewards, and misalignment between policy optimization and the natural decision granularity of agent interactions. To tackle these issues, the authors propose a unified episode-level tree-structured framework. The framework enhances exploration diversity through an entropy-guided tree expansion mechanism, introduces an episode-level credit assignment method to mitigate sparse reward problems, and incorporates an episode-level policy optimization objective orthogonal to tree search, thereby aligning learning with the intrinsic decision-making granularity of agents. Evaluated on seven benchmark tasks, the approach achieves an average performance improvement of 1.84 percentage points over existing methods, with ablation studies confirming the effectiveness of each component.

Technology Category

Application Category

📝 Abstract

LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

Problem

Research questions and friction points this paper is trying to address.

multi-turn agentic reinforcement learning

exploration diversity

credit assignment

policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic Reinforcement Learning

Tree Search

Turn-based Policy Optimization

Credit Assignment