CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

📅 2025-06-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently encounter diverse errors when invoking external tools for complex tasks, yet their capabilities in error identification, diagnosis, and recovery remain poorly understood and lack systematic evaluation. Method: We introduce the first benchmark dedicated to evaluating LLMs’ self-critical capabilities for tool-call errors. It features an evolutionary data synthesis strategy to generate realistic, diverse, and multi-complexity error instances; integrates multi-dimensional error injection with fine-grained annotation of critical reasoning behaviors; and enables standardized, scalable assessment across the three stages—identification, diagnosis, and recovery. Contribution/Results: Experiments reveal that state-of-the-art models achieve less than 40% diagnostic accuracy, exposing critical weaknesses. The benchmark and evaluation framework are publicly released, establishing a new paradigm for studying reflective mechanisms in tool-augmented LLMs.

Technology Category

Application Category

📝 Abstract
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at href{https://github.com/Shellorley0513/CriticTool}{https://github.com/Shellorley0513/CriticTool}.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' self-critique in tool-calling error scenarios
Analyzing error types in function-calling processes
Assessing tool reflection ability across various LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evolutionary strategy for diverse dataset construction
Comprehensive critique evaluation benchmark CRITICTOOL
In-depth analysis of tool reflection ability
🔎 Similar Papers
Shiting Huang
Shiting Huang
University of Glasgow
BiomechanicsComputational modelinglung resectionright ventricle
Z
Zhen Fang
University of Science and Technology of China, Communication University of China
Zehui Chen
Zehui Chen
USTC
S
Siyu Yuan
Fudan University
J
Junjie Ye
Fudan University
Y
Yu Zeng
University of Science and Technology of China
L
Lin Chen
University of Science and Technology of China
Q
Qi Mao
Communication University of China
F
Feng Zhao
University of Science and Technology of China