ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

📅 2024-08-08

🏛️ arXiv.org

📈 Citations: 23

✨ Influential: 3

career value

196K/year

🤖 AI Summary

Existing LLM tool-use evaluation benchmarks are limited to stateless API calls or offline trajectory analysis, failing to capture realistic multi-turn, state-dependent interactions. Method: We propose ToolSandbox—the first benchmark framework supporting stateful execution, multi-turn interaction, and online dialogue evaluation. It features (1) stateful tool execution with implicit state-dependency modeling; (2) a hybrid rule- and LLM-based user simulator enabling on-policy dynamic dialogue evaluation; and (3) a dynamic criterion mechanism assessing both intermediate and final milestones. Built upon finite-state machine modeling, RESTful sandbox encapsulation, and input normalization, it handles real-world challenges such as incomplete information. Results: Experiments reveal that current SOTA models achieve <40% accuracy on state-dependent tasks, exposing critical limitations in tool orchestration. ToolSandbox establishes a new paradigm and a rigorous, reproducible evaluation standard for LLM tool learning.

Technology Category

Application Category

📝 Abstract

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM tool-use capabilities comprehensively

Addressing stateful tool execution and dependencies

Assessing performance gaps in complex tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful tool execution with implicit dependencies

Built-in user simulator for conversational evaluation

Dynamic evaluation strategy for milestones

🔎 Similar Papers

StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models