DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

πŸ“… 2026-05-28
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing tool-augmented reasoning methods lack deliberate introspection and self-correction capabilities during intermediate steps and rely on sparse outcome-based rewards, making it difficult to effectively supervise complex reasoning processes. This work proposes DeepTool, a novel framework that introduces process-supervised reinforcement learning into tool-integrated reasoning for the first time. DeepTool features an action-centric process reward mechanism based on GRPO and incorporates adversarial perturbations at every step of the interleaved thought-action-observation loop to generate robust, self-correcting reasoning trajectories. Evaluated across six benchmarks, the method substantially enhances the performance of Qwen2.5-7Bβ€”e.g., improving accuracy on AIME24 from 3.2% to 40.4% and on HMMT25 from 0.0% to 28.6%β€”while achieving an optimal trade-off between performance and token efficiency.
πŸ“ Abstract
Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.
Problem

Research questions and friction points this paper is trying to address.

Tool-Integrated Reasoning
deliberation
reinforcement learning
process supervision
interleaved reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-Supervised Reinforcement Learning
Tool-Integrated Reasoning
Interleaved Deliberation
Action-Centric Process Reward
GRPO