🤖 AI Summary
Existing evaluations of tool-using agents predominantly rely on final accuracy, which fails to uncover their cognitive bottlenecks and capability boundaries. This work introduces cognitive load theory into the assessment of agent tool use, decomposing task complexity into intrinsic load—modeled via tool interaction graphs—and extrinsic load arising from task description ambiguity. The authors propose ToolLoad-Bench, a novel benchmark that enables parametric control over cognitive load, allowing diagnostic profiling of agent performance under varying load conditions. Empirical results reveal a “cliff effect,” where model performance sharply declines beyond a critical load threshold, and demonstrate strong alignment between predicted and observed outcomes. This framework establishes a new paradigm for understanding and optimizing tool-using agents through the lens of cognitive load.
📝 Abstract
The ability of Large Language Models (LLMs) to use external tools unlocks powerful real-world interactions, making rigorous evaluation essential. However, current benchmarks primarily report final accuracy, revealing what models can do but obscuring the cognitive bottlenecks that define their true capability boundaries. To move from simple performance scoring to a diagnostic tool, we introduce a framework grounded in Cognitive Load Theory. Our framework deconstructs task complexity into two quantifiable components: Intrinsic Load, the inherent structural complexity of the solution path, formalized with a novel Tool Interaction Graph; and Extraneous Load, the difficulty arising from ambiguous task presentation. To enable controlled experiments, we construct ToolLoad-Bench, the first benchmark with parametrically adjustable cognitive load. Our evaluation reveals distinct performance cliffs as cognitive load increases, allowing us to precisely map each model's capability boundary. We validate that our framework's predictions are highly calibrated with empirical results, establishing a principled methodology for understanding an agent's limits and a practical foundation for building more efficient systems.