ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use

📅 2025-01-05

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing benchmarks lack dedicated evaluation of multi-hop tool invocation capabilities, hindering systematic assessment of large language models’ (LLMs) ability to integrate tools under complex inter-tool dependencies and extended reasoning chains. To address this gap, we introduce ToolHop—the first query-driven benchmark for multi-hop tool calling—comprising 995 real-world user queries and 3,912 locally executable tools, covering deep dependency hierarchies and diverse invocation strategies. We propose a novel three-stage data construction paradigm: tool generation, documentation refinement, and code synthesis, grounded in real-world APIs. Evaluation leverages programmatic feedback, dependency graph constraints, and multi-model collaborative assessment (LLaMA-3.1, Qwen2.5, Gemini-1.5, Claude-3.5, GPT-4o). Experiments across 14 state-of-the-art LLMs reveal that even the top-performing model (GPT-4o) achieves only 49.04% accuracy, exposing a critical bottleneck. Furthermore, distinct model families exhibit systematic differences in strategic reasoning patterns.

Technology Category

Application Category

📝 Abstract

Effective evaluation of multi-hop tool use is critical for analyzing the understanding, reasoning, and function-calling capabilities of large language models (LLMs). However, progress has been hindered by a lack of reliable evaluation datasets. To address this, we present ToolHop, a dataset comprising 995 user queries and 3,912 associated tools, specifically designed for rigorous evaluation of multi-hop tool use. ToolHop ensures diverse queries, meaningful interdependencies, locally executable tools, detailed feedback, and verifiable answers through a novel query-driven data construction approach that includes tool creation, document refinement, and code generation. We evaluate 14 LLMs across five model families (i.e., LLaMA3.1, Qwen2.5, Gemini1.5, Claude3.5, and GPT), uncovering significant challenges in handling multi-hop tool-use scenarios. The leading model, GPT-4o, achieves an accuracy of 49.04%, underscoring substantial room for improvement. Further analysis reveals variations in tool-use strategies for various families, offering actionable insights to guide the development of more effective approaches. Code and data can be found in https://huggingface.co/bytedance-research/ToolHop.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Complex Task Evaluation

Multitool Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

ToolHop Dataset

Complex Task Evaluation

Model Optimization Insights

🔎 Similar Papers

No similar papers found.