🤖 AI Summary
Tool-integrated reasoning (TIR) lacks rigorous theoretical foundations, hindering understanding of how external tools extend large language models’ (LLMs) reasoning capabilities. Method: We formally prove that external tools strictly expand the set of solvable problems beyond pure textual reasoning. We propose Advantage-Shaping Policy Optimization (ASPO), a reinforcement learning algorithm that guides LLMs to invoke external tools—such as Python interpreters—efficiently and stably. Results: On mathematical reasoning benchmarks, ASPO significantly improves pass@k performance. Empirical analysis reveals an enhanced cognitive pattern: earlier tool invocation, increased interaction rounds, and balanced handling of both computational and abstract problems. Our core contributions are: (1) the first formal proof of TIR’s effectiveness; (2) the ASPO algorithm, enabling robust, policy-guided tool use; and (3) identification of a novel reasoning mechanism emergent from tool augmentation—characterized by dynamic, interactive, and hybrid symbolic-numeric processing.
📝 Abstract
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.