🤖 AI Summary
Large language models (LLMs) typically rely on supervised fine-tuning to acquire tool-use capabilities, limiting their autonomy and generalizability in dynamic reasoning environments.
Method: We propose ToRL, a purely reward-driven reinforcement learning framework that enables LLMs to autonomously discover and optimize computational tool usage strategies without any supervised signals. Built upon the PPO algorithm, ToRL integrates differentiable execution feedback and tool API encapsulation into the Qwen2.5-Math architecture, yielding emergent capabilities—including tool invocation, self-suppression of invalid code, and dynamic switching between computation and reasoning.
Contribution/Results: ToRL is the first method to achieve fully supervised-fine-tuning-free, strategic tool use in LLMs. Empirical evaluation shows that ToRL-7B achieves 43.3% accuracy on AIME 2024—outperforming the no-tool baseline by 14 percentage points and surpassing the current state-of-the-art tool-augmented reasoning model by 17 points.
📝 Abstract
We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14% and the best existing Tool-Integrated Reasoning (TIR) model by 17%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.