CogniLoad: A Synthetic Natural Language Reasoning Benchmark With Tunable Length, Intrinsic Difficulty, and Distractor Density

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Existing long-context reasoning benchmarks fail to orthogonally disentangle intrinsic task complexity, distractor density, and sequence length—hindering precise failure attribution. To address this, we introduce the first synthetic natural language inference benchmark grounded in Cognitive Load Theory, enabling independent control of intrinsic load (logical depth $d$), extraneous load (distractor ratio $ ho$), and germane load (task length $N$). Leveraging a parametric logic-puzzle generation framework, we conduct systematic evaluation across 22 state-of-the-art LLMs. Results reveal: (1) task length $N$ is the dominant performance bottleneck; (2) models exhibit markedly heterogeneous sensitivity to intrinsic complexity $d$; and (3) accuracy follows a U-shaped curve with respect to distractor density $ ho$, indicating non-monotonic robustness. This benchmark provides a reproducible, scalable, and multidimensionally controllable diagnostic tool for fine-grained reasoning capability analysis.

Technology Category

Application Category

📝 Abstract

Current benchmarks for long-context reasoning in Large Language Models (LLMs) often blur critical factors like intrinsic task complexity, distractor interference, and task length. To enable more precise failure analysis, we introduce CogniLoad, a novel synthetic benchmark grounded in Cognitive Load Theory (CLT). CogniLoad generates natural-language logic puzzles with independently tunable parameters that reflect CLT's core dimensions: intrinsic difficulty ($d$) controls intrinsic load; distractor-to-signal ratio ($ρ$) regulates extraneous load; and task length ($N$) serves as an operational proxy for conditions demanding germane load. Evaluating 22 SotA reasoning LLMs, CogniLoad reveals distinct performance sensitivities, identifying task length as a dominant constraint and uncovering varied tolerances to intrinsic complexity and U-shaped responses to distractor ratios. By offering systematic, factorial control over these cognitive load dimensions, CogniLoad provides a reproducible, scalable, and diagnostically rich tool for dissecting LLM reasoning limitations and guiding future model development.

Problem

Research questions and friction points this paper is trying to address.

Current benchmarks blur critical factors in long-context reasoning evaluation

Existing tools lack systematic control over cognitive load dimensions

There is need for precise failure analysis of LLM reasoning limitations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic benchmark with tunable cognitive load parameters

Generates logic puzzles with controlled difficulty and distractors

Systematically evaluates LLM reasoning limitations using factorial design

🔎 Similar Papers

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval