Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

📅 2026-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing evaluations of tool-using agents predominantly rely on final accuracy, which fails to uncover their cognitive bottlenecks and capability boundaries. This work introduces cognitive load theory into the assessment of agent tool use, decomposing task complexity into intrinsic load—modeled via tool interaction graphs—and extrinsic load arising from task description ambiguity. The authors propose ToolLoad-Bench, a novel benchmark that enables parametric control over cognitive load, allowing diagnostic profiling of agent performance under varying load conditions. Empirical results reveal a “cliff effect,” where model performance sharply declines beyond a critical load threshold, and demonstrate strong alignment between predicted and observed outcomes. This framework establishes a new paradigm for understanding and optimizing tool-using agents through the lens of cognitive load.

Technology Category

Application Category

📝 Abstract
The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.
Problem

Research questions and friction points this paper is trying to address.

Cognitive Load
Tool-use Agents
Capability Boundaries
Evaluation Framework
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive Load Theory
Tool Interaction Graph
Intrinsic Load
Extraneous Load
ToolLoad-Bench
🔎 Similar Papers
No similar papers found.
Q
Qihao Wang
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Y
Yue Hu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
M
Mingzhe Lu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
J
Jiayue Wu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Y
Yanbing Liu
Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences
Yuanmin Tang
Yuanmin Tang
University of Chinese Academy of Sciences
Machine learning