🤖 AI Summary
Large language models (LLMs) exhibit limited accuracy in precise computation and multi-step algebraic reasoning; while tool-integrated reasoning (TIR) improves correctness, it introduces runtime external dependencies that hinder scalability and deployment flexibility.
Method: We propose *Tool Knowledge Distillation*—a framework that internalizes external tools’ mathematical solving capabilities into the model via reverse translation: solution traces generated by a tool-using agent are transformed by a translation-and-rewriting agent into structured, logically coherent natural language reasoning chains. This process distills tool-augmented reasoning into pure textual form, enabling subsequent fine-tuning of open-source small models using only synthetic data—without any runtime tool invocation.
Contribution/Results: Our method achieves significant performance gains on competitive mathematics benchmarks (e.g., MATH, AMC), demonstrating that natural language distillation can effectively replace runtime tool reliance. It establishes a new paradigm for lightweight, autonomous, and deployable mathematical reasoning models.
📝 Abstract
Large language models (LLMs) often struggle with mathematical problems that require exact computation or multi-step algebraic reasoning. Tool-integrated reasoning (TIR) offers a promising solution by leveraging external tools such as code interpreters to ensure correctness, but it introduces inference-time dependencies that hinder scalability and deployment. In this work, we propose a new paradigm for distilling tool knowledge into LLMs purely through natural language. We first construct a Solver Agent that solves math problems by interleaving planning, symbolic tool calls, and reflective reasoning. Then, using a back-translation pipeline powered by multiple LLM-based agents, we convert interleaved TIR traces into natural language reasoning traces. A Translator Agent generates explanations for individual tool calls, while a Rephrase Agent merges them into a fluent and globally coherent narrative. Empirically, we show that fine-tuning a small open-source model on these synthesized traces enables it to internalize both tool knowledge and structured reasoning patterns, yielding gains on competition-level math benchmarks without requiring tool access at inference.