Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

📅 2025-11-02

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the challenge of eliciting intrinsic reasoning capabilities in language models via pure reinforcement learning (RL), thereby enhancing their generalization to unseen or complex tool-use scenarios. We propose a dynamically generalized reward mechanism that facilitates autonomous evolution from exploration to exploitation. Our method employs zero-initialized training, tool-agnostic reward design, and an autonomous tool-calling architecture—enabling, for the first time, end-to-end acquisition of tool-use competence without supervised fine-tuning (SFT). Under a unified experimental setup, our approach achieves an average performance gain of over 7% compared to SFT and RL+SFT baselines. It demonstrates strong robustness and generalization across both cross-dataset and in-dataset evaluations. This work establishes a novel paradigm for training tool-augmented language models exclusively through RL, advancing the frontier of unsupervised, reward-driven reasoning and tool interaction.

Technology Category

Application Category

📝 Abstract

Training tool-augmented LLMs has emerged as a promising approach to enhancing language models'capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.

Problem

Research questions and friction points this paper is trying to address.

Enhancing tool-augmented LLMs' generalization in unfamiliar scenarios

Developing pure RL methods without supervised fine-tuning dependency

Enabling autonomous tool usage through scalable reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure RL training without supervised fine-tuning

Dynamic generalization-guided reward design

Scaling up base models with tool-agnostic generalization

🔎 Similar Papers

ToolGen: Unified Tool Retrieval and Calling via Generation