StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

📅 2024-03-12

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 16

✨ Influential: 2

career value

226K/year

🤖 AI Summary

Existing tool-learning benchmarks face two critical bottlenecks: manually curated online tools are limited in scale, while real-world API endpoints suffer from instability, undermining evaluation reliability and reproducibility. To address this, we introduce the first stable, large-scale benchmark platform for LLM tool learning. Our method centers on (1) a virtual API server integrating dynamic caching and high-fidelity API simulation to eliminate state volatility inherent in live APIs; and (2) a deterministic, GPT-4–based automatic evaluation framework that computes solvable pass rate and win rate without stochasticity. Experiments demonstrate substantial improvements in evaluation stability and reproducibility. The platform enables rigorous, scalable, and verifiable tool-learning research—providing a robust infrastructure for comparative, large-scale empirical studies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.

Problem

Research questions and friction points this paper is trying to address.

Develop stable large-scale benchmarks for tool learning in LLMs.

Address instability in real online APIs for tool learning evaluation.

Introduce virtual API server and stable evaluation system for LLMs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Virtual API server with caching system

API simulators for stable tool learning

GPT-4 based automatic evaluation system

🔎 Similar Papers

No similar papers found.