๐ค AI Summary
Existing LLM tool-use evaluation benchmarks are limited to stateless API calls or offline trajectory analysis, failing to capture realistic multi-turn, state-dependent interactions. Method: We propose ToolSandboxโthe first benchmark framework supporting stateful execution, multi-turn interaction, and online dialogue evaluation. It features (1) stateful tool execution with implicit state-dependency modeling; (2) a hybrid rule- and LLM-based user simulator enabling on-policy dynamic dialogue evaluation; and (3) a dynamic criterion mechanism assessing both intermediate and final milestones. Built upon finite-state machine modeling, RESTful sandbox encapsulation, and input normalization, it handles real-world challenges such as incomplete information. Results: Experiments reveal that current SOTA models achieve <40% accuracy on state-dependent tasks, exposing critical limitations in tool orchestration. ToolSandbox establishes a new paradigm and a rigorous, reproducible evaluation standard for LLM tool learning.
๐ Abstract
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights into tool-use LLM capabilities. ToolSandbox evaluation framework is released at https://github.com/apple/ToolSandbox