🤖 AI Summary
Existing long-context reasoning benchmarks fail to orthogonally disentangle intrinsic task complexity, distractor density, and sequence length—hindering precise failure attribution. To address this, we introduce the first synthetic natural language inference benchmark grounded in Cognitive Load Theory, enabling independent control of intrinsic load (logical depth $d$), extraneous load (distractor ratio $
ho$), and germane load (task length $N$). Leveraging a parametric logic-puzzle generation framework, we conduct systematic evaluation across 22 state-of-the-art LLMs. Results reveal: (1) task length $N$ is the dominant performance bottleneck; (2) models exhibit markedly heterogeneous sensitivity to intrinsic complexity $d$; and (3) accuracy follows a U-shaped curve with respect to distractor density $
ho$, indicating non-monotonic robustness. This benchmark provides a reproducible, scalable, and multidimensionally controllable diagnostic tool for fine-grained reasoning capability analysis.
📝 Abstract
Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($ρ$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.