SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

📅 2024-11-20
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing tool-use benchmarks report only aggregate success rates, lacking fine-grained diagnostic insights into LLMs’ failure modes during tool invocation. Method: We introduce the first reproducible, attributable diagnostic benchmark for tool use. It systematically defines and empirically validates seven interpretable error categories—including malformed parameter formatting, incorrect tool selection, and context forgetting—grounded in multi-source real-world tool interaction scenarios. Our test suite integrates human annotation with pattern-driven analysis to enable automated error-type identification and statistical attribution. Contribution/Results: Experiments reveal that all seven error types occur pervasively across mainstream LLMs. The benchmark is publicly released, enabling precise error root-cause analysis and facilitating targeted model improvement. It significantly advances the interpretability, debuggability, and robustness of LLM-based tool-use systems.

Technology Category

Application Category

📝 Abstract
Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
Problem

Research questions and friction points this paper is trying to address.

Identifies error patterns in LLM tool-use outputs
Lacks detailed failure analysis in existing benchmarks
Characterizes seven new error types for mitigation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TOOLSCAN benchmark for LLM errors
Characterizes seven new error patterns
Tests diverse queries for error detection
Shirley Kokane
Shirley Kokane
Salesforce Research
M
Ming Zhu
Salesforce AI Research, USA
T
T. Awalgaonkar
Salesforce AI Research, USA
J
Jianguo Zhang
Salesforce AI Research, USA
T
Thai Hoang
Salesforce AI Research, USA
A
Akshara Prabhakar
Salesforce AI Research, USA
Zuxin Liu
Zuxin Liu
Salesforce AI Research
LLM agentreinforcement learningoptimizationembodied intelligence
T
Tian Lan
Salesforce AI Research, USA
Liangwei Yang
Liangwei Yang
Salesforce Research
Network ScienceRecommender SystemEfficient Modeling
Juntao Tan
Juntao Tan
Research Scientist, Salesforce
Machine LearningExplainable AIRecommendation SystemInformation Retrieval
R
Rithesh Murthy
Salesforce AI Research, USA
W
Weiran Yao
Salesforce AI Research, USA
Z
Zhiwei Liu
Salesforce AI Research, USA
Juan Carlos Niebles
Juan Carlos Niebles
Research Director (Salesforce) & Adjunct Professor (Stanford University)
Action RecognitionVideo UnderstandingVideo AnalysisComputer Vision
H
Huan Wang
Salesforce AI Research, USA
Shelby Heinecke
Shelby Heinecke
Salesforce Research
Artificial IntelligenceAI AgentsLLM AgentsMulti-Agent SystemsRecommendation Systems
Caiming Xiong
Caiming Xiong
Salesforce Research
Machine LearningNLPComputer VisionMultimediaData Mining
S
Silivo Savarese
Salesforce AI Research, USA