ToRL: Scaling Tool-Integrated RL

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Large language models (LLMs) typically rely on supervised fine-tuning to acquire tool-use capabilities, limiting their autonomy and generalizability in dynamic reasoning environments. Method: We propose ToRL, a purely reward-driven reinforcement learning framework that enables LLMs to autonomously discover and optimize computational tool usage strategies without any supervised signals. Built upon the PPO algorithm, ToRL integrates differentiable execution feedback and tool API encapsulation into the Qwen2.5-Math architecture, yielding emergent capabilities—including tool invocation, self-suppression of invalid code, and dynamic switching between computation and reasoning. Contribution/Results: ToRL is the first method to achieve fully supervised-fine-tuning-free, strategic tool use in LLMs. Empirical evaluation shows that ToRL-7B achieves 43.3% accuracy on AIME 2024—outperforming the no-tool baseline by 14 percentage points and surpassing the current state-of-the-art tool-augmented reasoning model by 17 points.

Technology Category

Application Category

📝 Abstract

We introduce ToRL (Tool-Integrated Reinforcement Learning), a framework for training large language models (LLMs) to autonomously use computational tools via reinforcement learning. Unlike supervised fine-tuning, ToRL allows models to explore and discover optimal strategies for tool use. Experiments with Qwen2.5-Math models show significant improvements: ToRL-7B reaches 43.3% accuracy on AIME~24, surpassing reinforcement learning without tool integration by 14% and the best existing Tool-Integrated Reasoning (TIR) model by 17%. Further analysis reveals emergent behaviors such as strategic tool invocation, self-regulation of ineffective code, and dynamic adaptation between computational and analytical reasoning, all arising purely through reward-driven learning.

Problem

Research questions and friction points this paper is trying to address.

Training LLMs to autonomously use computational tools via RL

Improving accuracy in tool-integrated reasoning tasks

Enabling emergent behaviors like strategic tool invocation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Tool-Integrated Reinforcement Learning framework

Autonomous tool use via RL

Emergent strategic tool behaviors

🔎 Similar Papers

SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning