Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

179K/year
🤖 AI Summary
This work addresses the challenge of enabling reasoning models to efficiently integrate external tools while preserving their native text-based reasoning capabilities. The authors propose a comprehensive Tool-Integrated Reasoning (TIR) training framework that first performs supervised fine-tuning on problems amenable to tool assistance, carefully controlling the proportion of tool usage to mitigate catastrophic forgetting. They further introduce Reinforcement Learning with Verifiable Rewards (RLVR), designing an optimization objective that combines pass@k accuracy and response length. The study highlights the critical role of the learnability of teacher trajectories in achieving effective TIR. Evaluated on the AIME 2025 benchmark, the approach achieves state-of-the-art performance among open-source models, attaining scores of 96.7% and 99.2% on Qwen3-4B and Qwen3-30B, respectively.
📝 Abstract
Tool-integrated reasoning (TIR) offers a direct way to extend thinking models beyond the limits of text-only reasoning. Paradoxically, we observe that tool-enabled evaluation can degrade reasoning performance even when the strong thinking models make almost no actual tool calls. In this paper, we investigate how to inject natural tool-use behavior into a strong thinking model without sacrificing its no-tool reasoning ability, and present a comprehensive TIR recipe. We highlight that (i) the effectiveness of TIR supervised fine-tuning (SFT) hinges on the learnability of teacher trajectories, which should prioritize problems inherently suited for tool-augmented solutions; (ii) controlling the proportion of tool-use trajectories could mitigate the catastrophic forgetting of text-only reasoning capacity; (iii) optimizing for pass@k and response length instead of training loss could maximize TIR SFT gains while preserving headroom for reinforcement learning (RL) exploration; (iv) a stable RL with verifiable rewards (RLVR) stage, built upon suitable SFT initialization and explicit safeguards against mode collapse, provides a simple yet remarkably effective solution. When applied to Qwen3 thinking models at 4B and 30B scales, our recipe yields models that achieve state-of-the-art performance in a wide range of benchmarks among open-source models, such as 96.7% and 99.2% on AIME 2025 for 4B and 30B, respectively.
Problem

Research questions and friction points this paper is trying to address.

tool-integrated reasoning
thinking models
catastrophic forgetting
reasoning performance
tool-use behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Integrated Reasoning
Supervised Fine-Tuning
Catastrophic Forgetting Mitigation
Reinforcement Learning with Verifiable Rewards
Thinking Models
🔎 Similar Papers