Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Large language model (LLM) agents suffer from coarse-grained credit assignment in multi-step tool-use reasoning, and trajectory-level reinforcement learning struggles to attribute rewards accurately across individual reasoning steps. Method: We formalize agent–environment interaction as a Markov decision process and propose a step-wise fine-grained advantage estimation method that enables precise per-turn credit assignment. This approach integrates seamlessly into preference optimization frameworks—e.g., GRPO—without modifying the underlying training infrastructure. Results: On complex decision-making benchmarks, our method achieves 100% tool execution success rate and 50% exact answer match accuracy—outperforming baselines by 20–30 percentage points. It significantly improves the robustness and consistency of LLM agents’ multi-step reasoning, enabling more reliable and interpretable tool-augmented inference.

Technology Category

Application Category

📝 Abstract

This paper investigates approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents using Reinforcement Learning (RL). Specifically, we focus on multi-turn tool-use scenarios, which can be naturally modeled as Markov Decision Processes (MDPs). While existing approaches often train multi-turn LLM agents with trajectory-level advantage estimation in bandit settings, they struggle with turn-level credit assignment across multiple decision steps, limiting their performance on multi-turn reasoning tasks. To address this, we introduce a fine-grained turn-level advantage estimation strategy to enable more precise credit assignment in multi-turn agent interactions. The strategy is general and can be incorporated into various RL algorithms such as Group Relative Preference Optimization (GRPO). Our experimental evaluation on multi-turn reasoning and search-based tool-use tasks with GRPO implementations highlights the effectiveness of the MDP framework and the turn-level credit assignment in advancing the multi-turn reasoning capabilities of LLM agents in complex decision-making settings. Our method achieves 100% success in tool execution and 50% accuracy in exact answer matching, significantly outperforming baselines, which fail to invoke tools and achieve only 20-30% exact match accuracy.

Problem

Research questions and friction points this paper is trying to address.

Enhance multi-turn reasoning in LLM agents

Improve turn-level credit assignment in RL

Optimize tool-use tasks via MDP framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Turn-level advantage estimation for credit assignment

MDP framework for multi-turn tool-use scenarios

Integration with RL algorithms like GRPO

🔎 Similar Papers

Self-playing Adversarial Language Game Enhances LLM Reasoning