🤖 AI Summary
Existing tool-learning benchmarks face two critical bottlenecks: manually curated online tools are limited in scale, while real-world API endpoints suffer from instability, undermining evaluation reliability and reproducibility. To address this, we introduce the first stable, large-scale benchmark platform for LLM tool learning. Our method centers on (1) a virtual API server integrating dynamic caching and high-fidelity API simulation to eliminate state volatility inherent in live APIs; and (2) a deterministic, GPT-4–based automatic evaluation framework that computes solvable pass rate and win rate without stochasticity. Experiments demonstrate substantial improvements in evaluation stability and reproducibility. The platform enables rigorous, scalable, and verifiable tool-learning research—providing a robust infrastructure for comparative, large-scale empirical studies.
📝 Abstract
Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system.