ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

๐Ÿ“… 2024-08-08
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 23
โœจ Influential: 3
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM tool-use evaluation benchmarks are limited to stateless API calls or offline trajectory analysis, failing to capture realistic multi-turn, state-dependent interactions. Method: We propose ToolSandboxโ€”the first benchmark framework supporting stateful execution, multi-turn interaction, and online dialogue evaluation. It features (1) stateful tool execution with implicit state-dependency modeling; (2) a hybrid rule- and LLM-based user simulator enabling on-policy dynamic dialogue evaluation; and (3) a dynamic criterion mechanism assessing both intermediate and final milestones. Built upon finite-state machine modeling, RESTful sandbox encapsulation, and input normalization, it handles real-world challenges such as incomplete information. Results: Experiments reveal that current SOTA models achieve <40% accuracy on state-dependent tasks, exposing critical limitations in tool orchestration. ToolSandbox establishes a new paradigm and a rigorous, reproducible evaluation standard for LLM tool learning.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM tool-use capabilities comprehensively
Addressing stateful tool execution and dependencies
Assessing performance gaps in complex tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Stateful tool execution with implicit dependencies
Built-in user simulator for conversational evaluation
Dynamic evaluation strategy for milestones
๐Ÿ”Ž Similar Papers
2024-03-12Annual Meeting of the Association for Computational LinguisticsCitations: 16
J
Jiarui Lu
Apple
T
Thomas Holleis
Apple
Y
Yizhe Zhang
Apple
B
Bernhard Aumayer
Apple
Feng Nan
Feng Nan
Apple
F
Felix Bai
Apple
Shuang Ma
Shuang Ma
Apple AI/ML
LLMFoundation Models
S
Shen Ma
Apple
M
Mengyu Li
Apple
Guoli Yin
Guoli Yin
Engineer at Apple
Z
Zirui Wang
Apple
Ruoming Pang
Ruoming Pang
Apple AI/ML
Deep learning