🤖 AI Summary
To address the high variance and unstable convergence of policy gradients in tool-augmented large language models (LLMs) trained via reinforcement learning—primarily caused by sparse rewards—this paper proposes an entropy-aware, token-level policy gradient reshaping method. We first establish a theoretical connection between policy entropy and task stability in tool usage, then design a progressive reweighting mechanism that dynamically amplifies gradient weights for reasoning-relevant tokens, enabling smooth optimization from structural correctness to semantic reasoning. By integrating information-theoretic entropy modeling with fine-grained gradient control, our approach achieves state-of-the-art performance on BFCL and API-Bank, outperforming prior methods by up to 8.76%. On a 4B-parameter model, it surpasses GPT-4o by 4.11% in single-turn tasks and by 1.50% in multi-turn tasks.
📝 Abstract
Large language models (LLMs) transcend passive generation and act as goal-directed agents by invoking external tools. Reinforcement learning (RL) offers a principled framework for optimizing these emergent tool-use policies, yet the prevailing paradigm relies exclusively on sparse outcome rewards and lacks consideration of the particularity of tool-use tasks, inflating policy-gradient variance and resulting in inefficient training. To better understand and address these challenges, we first establish a theoretical link between policy entropy and training stability of tool-use tasks, which reveals that structured, low-entropy tokens are primary determinants of rewards. Motivated by this insight, we propose extbf{Res}haped extbf{T}oken-level policy gradients ( extbf{ResT}) for tool-use tasks. ResT reshapes the policy gradient through entropy-informed token reweighting, progressively upweighting reasoning tokens as training proceeds. This entropy-aware scheme enables a smooth shift from structural correctness to semantic reasoning and stabilizes convergence in multi-turn tool-use tasks. Evaluation on BFCL and API-Bank shows that ResT achieves state-of-the-art results, outperforming prior methods by up to $8.76%$. When fine-tuned on a 4B base LLM, ResT further surpasses GPT-4o by $4.11%$ on single-turn tasks and $1.50%$ on multi-turn base tasks.