Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch

📅 2025-11-02
🏛️ Conference on Empirical Methods in Natural Language Processing
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of eliciting intrinsic reasoning capabilities in language models via pure reinforcement learning (RL), thereby enhancing their generalization to unseen or complex tool-use scenarios. We propose a dynamically generalized reward mechanism that facilitates autonomous evolution from exploration to exploitation. Our method employs zero-initialized training, tool-agnostic reward design, and an autonomous tool-calling architecture—enabling, for the first time, end-to-end acquisition of tool-use competence without supervised fine-tuning (SFT). Under a unified experimental setup, our approach achieves an average performance gain of over 7% compared to SFT and RL+SFT baselines. It demonstrates strong robustness and generalization across both cross-dataset and in-dataset evaluations. This work establishes a novel paradigm for training tool-augmented language models exclusively through RL, advancing the frontier of unsupervised, reward-driven reasoning and tool interaction.

Technology Category

Application Category

📝 Abstract
Training tool-augmented LLMs has emerged as a promising approach to enhancing language models'capabilities for complex tasks. The current supervised fine-tuning paradigm relies on constructing extensive domain-specific datasets to train models. However, this approach often struggles to generalize effectively to unfamiliar or intricate tool-use scenarios. Recently, reinforcement learning (RL) paradigm can endow LLMs with superior reasoning and generalization abilities. In this work, we address a key question: Can the pure RL be used to effectively elicit a model's intrinsic reasoning capabilities and enhance the tool-agnostic generalization? We propose a dynamic generalization-guided reward design for rule-based RL, which progressively shifts rewards from exploratory to exploitative tool-use patterns. Based on this design, we introduce the Tool-Zero series models. These models are trained to enable LLMs to autonomously utilize general tools by directly scaling up RL from Zero models (i.e., base models without post-training). Experimental results demonstrate that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models under the same experimental settings. These gains are consistently replicated across cross-dataset and intra-dataset evaluations, validating the effectiveness and robustness of our methods.
Problem

Research questions and friction points this paper is trying to address.

Enhancing tool-augmented LLMs' generalization in unfamiliar scenarios
Developing pure RL methods without supervised fine-tuning dependency
Enabling autonomous tool usage through scalable reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pure RL training without supervised fine-tuning
Dynamic generalization-guided reward design
Scaling up base models with tool-agnostic generalization
🔎 Similar Papers
No similar papers found.
Y
Yirong Zeng
Harbin Institute of Technology SCIR Lab
X
Xiao Ding
Harbin Institute of Technology SCIR Lab
Yutai Hou
Yutai Hou
Huawei
LLMNLPDialogueAlignmentMeta Learning
Y
Yuxian Wang
Huawei Technologies Co., Ltd
L
Li Du
Beijing Academy of Artificial Intelligence
J
Juyi Dai
Harbin Institute of Technology SCIR Lab
Q
Qiuyang Ding
Huawei Technologies Co., Ltd
Duyu Tang
Duyu Tang
Huawei
Natural Language Processing
D
Dandan Tu
Huawei Technologies Co., Ltd
Weiwen Liu
Weiwen Liu
Associate Professor, Shanghai Jiao Tong University
large language modelsAI agentsrecommender systems
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis
T
Ting Liu
Harbin Institute of Technology SCIR Lab