GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the generalization bottleneck of large language models (LLMs) in tool usage, formally modeling two critical generalization capabilities: Zero-to-One (handling unseen queries and novel tools) and Weak-to-Strong (solving tasks requiring powerful tools using only weaker tool variants). To this end, we propose a two-stage supervised fine-tuning paradigm: first optimizing tool ranking, then refining tool selection—enabled by controllably synthesized data to achieve zero-shot tool adaptation and progressive capability advancement. Experiments across four generalization benchmarks demonstrate that our method significantly outperforms mainstream baselines; notably, 1B–8B parameter models surpass GPT-4o across all evaluated metrics. Furthermore, we provide the first empirical characterization of structural bottlenecks underlying LLM tool generalization, revealing fundamental limitations in cross-tool and cross-query transfer.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) can enhance their capabilities as AI assistants by integrating external tools, allowing them to access a wider range of information. While recent LLMs are typically fine-tuned with tool usage examples during supervised fine-tuning (SFT), questions remain about their ability to develop robust tool-usage skills and can effectively generalize to unseen queries and tools. In this work, we present GenTool, a novel training framework that prepares LLMs for diverse generalization challenges in tool utilization. Our approach addresses two fundamental dimensions critical for real-world applications: Zero-to-One Generalization, enabling the model to address queries initially lacking a suitable tool by adopting and utilizing one when it becomes available, and Weak-to-Strong Generalization, allowing models to leverage enhanced versions of existing tools to solve queries. To achieve this, we develop synthetic training data simulating these two dimensions of tool usage and introduce a two-stage fine-tuning approach: optimizing tool ranking, then refining tool selection. Through extensive experiments across four generalization scenarios, we demonstrate that our method significantly enhances the tool-usage capabilities of LLMs ranging from 1B to 8B parameters, achieving performance that surpasses GPT-4o. Furthermore, our analysis also provides valuable insights into the challenges LLMs encounter in tool generalization.
Problem

Research questions and friction points this paper is trying to address.

Enhance tool generalization in LLMs
Address unseen queries and new tools
Improve tool ranking and selection accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-to-One Generalization
Weak-to-Strong Generalization
Synthetic training data simulation
🔎 Similar Papers
No similar papers found.