StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

📅 2024-03-12
🏛️ Annual Meeting of the Association for Computational Linguistics
📈 Citations: 16
Influential: 2
📄 PDF
🤖 AI Summary
Existing tool-learning benchmarks face two critical bottlenecks: manually curated online tools are limited in scale, while real-world API endpoints suffer from instability, undermining evaluation reliability and reproducibility. To address this, we introduce the first stable, large-scale benchmark platform for LLM tool learning. Our method centers on (1) a virtual API server integrating dynamic caching and high-fidelity API simulation to eliminate state volatility inherent in live APIs; and (2) a deterministic, GPT-4–based automatic evaluation framework that computes solvable pass rate and win rate without stochasticity. Experiments demonstrate substantial improvements in evaluation stability and reproducibility. The platform enables rigorous, scalable, and verifiable tool-learning research—providing a robust infrastructure for comparative, large-scale empirical studies.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.
Problem

Research questions and friction points this paper is trying to address.

Develop stable large-scale benchmarks for tool learning in LLMs.
Address instability in real online APIs for tool learning evaluation.
Introduce virtual API server and stable evaluation system for LLMs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual API server with caching system
API simulators for stable tool learning
GPT-4 based automatic evaluation system
🔎 Similar Papers
No similar papers found.
Z
Zhicheng Guo
Department of Computer Science and Technology, Tsinghua University; Institute for AI Industry Research (AIR), Tsinghua University
S
Sijie Cheng
Department of Computer Science and Technology, Tsinghua University; Institute for AI Industry Research (AIR), Tsinghua University; 01.AI4Google
H
Hao Wang
The University of Hong Kong
Shihao Liang
Shihao Liang
ByteDance
Multimodal AgentAgent Evaluation
Yujia Qin
Yujia Qin
ByteDance
Agent
P
Peng Li
Institute for AI Industry Research (AIR), Tsinghua University
Z
Zhiyuan Liu
Department of Computer Science and Technology, Tsinghua University; Institute for AI Industry Research (AIR), Tsinghua University
Maosong Sun
Maosong Sun
Professor of Computer Science and Technology, Tsinghua University
Natural Language ProcessingArtificial IntelligenceSocial Computing
Y
Yang Liu
Department of Computer Science and Technology, Tsinghua University; Institute for AI Industry Research (AIR), Tsinghua University