A Framework for Robust Cognitive Evaluation of LLMs

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The evaluation of large language models’ (LLMs) cognitive capabilities lacks standardized, methodologically rigorous frameworks. Method: This paper introduces CognitivEval—a systematic framework for assessing LLMs’ artificial cognitive abilities—with a core emphasis on response-collection robustness. It proposes a novel joint evaluation paradigm integrating generative outputs with model-calibrated probability estimates. To enhance experimental stability and reproducibility, it incorporates automated prompt permutations, prompt engineering automation, probability calibration analysis, cognitive task transfer, and multi-model benchmarking. Contribution/Results: CognitivEval successfully replicates five canonical cognitive science experiments, constructs comprehensive cognitive capability maps across multiple state-of-the-art LLMs, and significantly improves evaluation robustness. The framework will be open-sourced to facilitate interdisciplinary research at the intersection of cognitive science and AI.

Technology Category

Application Category

📝 Abstract
Emergent cognitive abilities in large language models (LLMs) have been widely observed, but their nature and underlying mechanisms remain poorly understood. A growing body of research draws on cognitive science to investigate LLM cognition, but standard methodologies and experimen-tal pipelines have not yet been established. To address this gap we develop CognitivEval, a framework for systematically evaluating the artificial cognitive capabilities of LLMs, with a particular emphasis on robustness in response collection. The key features of CognitivEval include: (i) automatic prompt permutations, and (ii) testing that gathers both generations and model probability estimates. Our experiments demonstrate that these features lead to more robust experimental outcomes. Using CognitivEval, we replicate five classic experiments in cognitive science, illustrating the framework's generalizability across various experimental tasks and obtaining a cognitive profile of several state of the art LLMs. CognitivEval will be released publicly to foster broader collaboration within the cognitive science community.
Problem

Research questions and friction points this paper is trying to address.

Understanding emergent cognitive abilities in large language models
Lack of standard methodologies for evaluating LLM cognition
Need for robust frameworks to assess artificial cognitive capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatic prompt permutations for robust evaluation
Gathering generations and probability estimates
Cognitive profile creation for state-of-the-art LLMs
🔎 Similar Papers
No similar papers found.
Karin de Langis
Karin de Langis
PhD Candidate, University of Minnesota
Artificial IntelligenceRoboticsComputer Vision
Jong Inn Park
Jong Inn Park
University of Minnesota
Natural Language Processing
B
Bin Hu
Department of Computer Science and Engineering, University of Minnesota
K
Khanh Chi Le
Department of Computer Science and Engineering, University of Minnesota
A
Andreas Schramm
Department of Linguistics, Hamline University
M
M. Mensink
Department of Psychology, University of Wisconsin-Stout
A
Andrew Elfenbein
Department of English, University of Minnesota
Dongyeop Kang
Dongyeop Kang
University of Minnesota
Natural Language Processing