Empowering Multi-Turn Tool-Integrated Reasoning with Group Turn Policy Optimization

📅 2025-11-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning methods (e.g., GRPO) struggle to optimize large language models (LLMs) for multi-turn tool-integrated reasoning (TIR) due to reliance on sparse, coarse-grained trajectory-level rewards, which inadequately support sustained optimization over complex, multi-step interactions. To address this, we propose a fine-grained turn-level reward allocation mechanism that integrates code-execution feedback–based discounted return computation with self-supervised reward shaping, and introduces return-value advantage estimation to mitigate reward sparsity. Our core contribution lies in refining reward modeling from the trajectory level down to individual turn-level decision units, while leveraging execution feedback to construct differentiable, dense supervision signals. Evaluated on multiple mathematical reasoning benchmarks, our method achieves an average 3.0% improvement over GRPO, significantly enhancing LLMs’ reasoning stability and generalization capability in realistic multi-turn tool-calling scenarios.

Technology Category

Application Category

📝 Abstract
Training Large Language Models (LLMs) for multi-turn Tool-Integrated Reasoning (TIR) - where models iteratively reason, generate code, and verify through execution - remains challenging for existing reinforcement learning (RL) approaches. Current RL methods, exemplified by Group Relative Policy Optimization (GRPO), suffer from coarse-grained, trajectory-level rewards that provide insufficient learning signals for complex multi-turn interactions, leading to training stagnation. To address this issue, we propose Group Turn Policy Optimization (GTPO), a novel RL algorithm specifically designed for training LLMs on multi-turn TIR tasks. GTPO introduces three key innovations: (1) turn-level reward assignment that provides fine-grained feedback for individual turns, (2) return-based advantage estimation where normalized discounted returns are calculated as advantages, and (3) self-supervised reward shaping that exploits self-supervision signals from generated code to densify sparse binary outcome-based rewards. Our comprehensive evaluation demonstrates that GTPO outperforms GRPO by 3.0% on average across diverse reasoning benchmarks, establishing its effectiveness for advancing complex mathematical reasoning in the real world.
Problem

Research questions and friction points this paper is trying to address.

Improves multi-turn tool-integrated reasoning in large language models
Addresses coarse-grained rewards causing training stagnation in RL methods
Enhances mathematical reasoning through fine-grained turn-level feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Turn-level reward assignment for fine-grained feedback
Return-based advantage estimation using normalized discounted returns
Self-supervised reward shaping with code execution signals