🤖 AI Summary
This work addresses the challenge of coarse-grained reward assignment in existing tool-integrated reasoning methods, which struggle to distinguish effective tool calls from redundant or erroneous ones in long-horizon, multi-turn tasks. To enable fine-grained credit allocation, the authors introduce bipartite graph matching for the first time to align predicted and ground-truth trajectories, formulating the alignment as a matching problem. They further propose a dual-level advantage estimation mechanism that integrates both turn-level and trajectory-level signals to produce precise turn-level rewards. This approach significantly enhances the tool-using reasoning capabilities of large language models, outperforming current methods across three benchmarks. Notably, their 4B-parameter model surpasses most 8B-parameter counterparts, with particularly strong performance in long-horizon, multi-turn scenarios.
📝 Abstract
Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.