ResT: Reshaping Token-Level Policy Gradients for Tool-Use Large Language Models

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high variance and unstable convergence of policy gradients in tool-augmented large language models (LLMs) trained via reinforcement learning—primarily caused by sparse rewards—this paper proposes an entropy-aware, token-level policy gradient reshaping method. We first establish a theoretical connection between policy entropy and task stability in tool usage, then design a progressive reweighting mechanism that dynamically amplifies gradient weights for reasoning-relevant tokens, enabling smooth optimization from structural correctness to semantic reasoning. By integrating information-theoretic entropy modeling with fine-grained gradient control, our approach achieves state-of-the-art performance on BFCL and API-Bank, outperforming prior methods by up to 8.76%. On a 4B-parameter model, it surpasses GPT-4o by 4.11% in single-turn tasks and by 1.50% in multi-turn tasks.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose extbf{Res}haped extbf{T}oken-level policy gradients ( extbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11%$ on single-turn tasks and $1.50%$ on multi-turn base tasks.
Problem

Research questions and friction points this paper is trying to address.

Optimizing tool-use policies in large language models
Reducing policy-gradient variance in reinforcement learning
Stabilizing training convergence for multi-turn tool-use tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reshapes policy gradients via token reweighting
Uses entropy to prioritize reasoning tokens progressively
Stabilizes convergence in multi-turn tool-use tasks
🔎 Similar Papers
No similar papers found.