A Neuropsychologically Grounded Evaluation of LLM Cognitive Abilities

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work proposes NeuroCognition, a novel benchmark that introduces three classic neuropsychological paradigms—Raven’s Progressive Matrices, spatial working memory tasks, and the Wisconsin Card Sorting Test—into large language model (LLM) evaluation to systematically assess core cognitive capacities: abstract reasoning, working memory, and cognitive flexibility. While existing LLM evaluations predominantly focus on task-specific performance, NeuroCognition targets the foundational cognitive abilities underpinning human intelligence. Multimodal experiments across 156 models reveal that NeuroCognition correlates with general capability benchmarks yet captures distinct cognitive traits, exposing significant performance degradation in image-based tasks and high-complexity scenarios. Notably, simple human-like strategies often outperform sophisticated reasoning approaches. This study establishes a verifiable pathway toward developing human-like, adaptive cognitive capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit a unified "general factor" of capability across 10 benchmarks, a finding confirmed by our factor analysis of 156 models, yet they still struggle with simple, trivial tasks for humans. This is because current benchmarks focus on task completion, failing to probe the foundational cognitive abilities that highlight these behaviors. We address this by introducing the NeuroCognition benchmark, grounded in three adapted neuropsychological tests: Raven's Progressive Matrices (abstract relational reasoning), Spatial Working Memory (maintenance and systematic search), and the Wisconsin Card Sorting Test (cognitive flexibility). Our evaluation reveals that while models perform strongly on text, their performance degrades for images and with increased complexity. Furthermore, we observe that complex reasoning is not universally beneficial, whereas simple, human-like strategies yield partial gains. We also find that NeuroCognition correlates positively with standard general-capability benchmarks, while still measuring distinct cognitive abilities beyond them. Overall, NeuroCognition emphasizes where current LLMs align with human-like intelligence and where they lack core adaptive cognition, showing the potential to serve as a verifiable, scalable source for improving LLMs.

Problem

Research questions and friction points this paper is trying to address.

large language models

cognitive abilities

neuropsychological evaluation

benchmarking

human-like intelligence

Innovation

Methods, ideas, or system contributions that make the work stand out.

NeuroCognition

cognitive evaluation

neuropsychological benchmarks