Beyond Accuracy: A Cognitive Load Framework for Mapping the Capability Boundaries of Tool-use Agents

📅 2026-01-28

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Existing evaluations of tool-using agents predominantly rely on final accuracy, which fails to uncover their cognitive bottlenecks and capability boundaries. This work introduces cognitive load theory into the assessment of agent tool use, decomposing task complexity into intrinsic load—modeled via tool interaction graphs—and extrinsic load arising from task description ambiguity. The authors propose ToolLoad-Bench, a novel benchmark that enables parametric control over cognitive load, allowing diagnostic profiling of agent performance under varying load conditions. Empirical results reveal a “cliff effect,” where model performance sharply declines beyond a critical load threshold, and demonstrate strong alignment between predicted and observed outcomes. This framework establishes a new paradigm for understanding and optimizing tool-using agents through the lens of cognitive load.

Technology Category

Application Category

📝 Abstract

The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.

Problem

Research questions and friction points this paper is trying to address.

Cognitive Load

Tool-use Agents

Capability Boundaries

Evaluation Framework

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cognitive Load Theory

Tool Interaction Graph

Intrinsic Load