ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

📅 2025-01-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks lack dedicated evaluation of multi-hop tool invocation capabilities, hindering systematic assessment of large language models’ (LLMs) ability to integrate tools under complex inter-tool dependencies and extended reasoning chains. To address this gap, we introduce ToolHop—the first query-driven benchmark for multi-hop tool calling—comprising 995 real-world user queries and 3,912 locally executable tools, covering deep dependency hierarchies and diverse invocation strategies. We propose a novel three-stage data construction paradigm: tool generation, documentation refinement, and code synthesis, grounded in real-world APIs. Evaluation leverages programmatic feedback, dependency graph constraints, and multi-model collaborative assessment (LLaMA-3.1, Qwen2.5, Gemini-1.5, Claude-3.5, GPT-4o). Experiments across 14 state-of-the-art LLMs reveal that even the top-performing model (GPT-4o) achieves only 49.04% accuracy, exposing a critical bottleneck. Furthermore, distinct model families exhibit systematic differences in strategic reasoning patterns.

Technology Category

Application Category

📝 Abstract
Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Complex Task Evaluation
Multitool Integration
Innovation

Methods, ideas, or system contributions that make the work stand out.

ToolHop Dataset
Complex Task Evaluation
Model Optimization Insights
🔎 Similar Papers
No similar papers found.
J
Junjie Ye
School of Computer Science, Fudan University; ByteDance
Zhengyin Du
Zhengyin Du
ByteDance Seed
Large Language ModelMulti-modal Learning
Xuesong Yao
Xuesong Yao
Master of Mechanics, Peking University
Machine LearningLarge language model
Weijian Lin
Weijian Lin
Carnegie Mellon University
Machine LearningRecommendation SystemsLarge Language Model
Y
Yufei Xu
ByteDance
Zehui Chen
Zehui Chen
USTC
Zaiyuan Wang
Zaiyuan Wang
ByteDance
AILLMFunction CallAgent
S
Sining Zhu
ByteDance
Zhiheng Xi
Zhiheng Xi
Fudan University
LLM ReasoningLLM-based Agents
S
Siyu Yuan
School of Data Science, Fudan University
T
Tao Gui
Institute of Modern Languages and Linguistics, Fudan University
Q
Qi Zhang
School of Computer Science, Fudan University
X
Xuanjing Huang
School of Computer Science, Fudan University
J
Jiechao Chen
ByteDance