ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

πŸ“… 2024-08-08
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 23
✨ Influential: 3
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
Existing LLM tool-use evaluation benchmarks are limited to stateless API calls or offline trajectory analysis, failing to capture realistic multi-turn, state-dependent interactions. Method: We propose ToolSandboxβ€”the first benchmark framework supporting stateful execution, multi-turn interaction, and online dialogue evaluation. It features (1) stateful tool execution with implicit state-dependency modeling; (2) a hybrid rule- and LLM-based user simulator enabling on-policy dynamic dialogue evaluation; and (3) a dynamic criterion mechanism assessing both intermediate and final milestones. Built upon finite-state machine modeling, RESTful sandbox encapsulation, and input normalization, it handles real-world challenges such as incomplete information. Results: Experiments reveal that current SOTA models achieve <40% accuracy on state-dependent tasks, exposing critical limitations in tool orchestration. ToolSandbox establishes a new paradigm and a rigorous, reproducible evaluation standard for LLM tool learning.

Technology Category

Application Category

πŸ“ Abstract
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM tool-use capabilities comprehensively
Addressing stateful tool execution and dependencies
Assessing performance gaps in complex tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful tool execution with implicit dependencies
Built-in user simulator for conversational evaluation
Dynamic evaluation strategy for milestones
πŸ”Ž Similar Papers
2024-03-12Annual Meeting of the Association for Computational LinguisticsCitations: 16