🤖 AI Summary
Existing evaluation paradigms for large language models (LLMs) lack sensitivity to human-like cognitive capacities, limiting their ability to discriminate fine-grained differences in model quality.
Method: We systematically compare three evaluation paradigms—large-scale QA benchmarks (MMLU, BBH), game-theoretic interactive tasks (Signalling Game, Taboo), and cognitively grounded tasks designed from human cognitive theories (e.g., working memory, theory of mind)—and introduce the first application of cognitive assessment frameworks to LLM evaluation, integrating correlational and causal analysis.
Contribution/Results: Interactive games significantly outperform static benchmarks in discriminating executive functions and socio-emotional abilities. While causal and logical reasoning exhibit high cross-paradigm consistency, core executive functions and social cognition emerge robustly only within the interactive paradigm. This work establishes a novel, cognitively aligned evaluation framework that better captures human-relevant dimensions of LLM competence.
📝 Abstract
We examine three evaluation paradigms: large question-answering benchmarks (e.g., MMLU and BBH), interactive games (e.g., Signalling Games or Taboo), and cognitive tests (e.g., for working memory or theory of mind). First, we investigate which of the former two-benchmarks or games-is most effective at discriminating LLMs of varying quality. Then, inspired by human cognitive assessments, we compile a suite of targeted tests that measure cognitive abilities deemed essential for effective language use, and we investigate their correlation with model performance in benchmarks and games. Our analyses reveal that interactive games are superior to standard benchmarks in discriminating models. Causal and logical reasoning correlate with both static and interactive tests, while differences emerge regarding core executive functions and social/emotional skills, which correlate more with games. We advocate the development of new interactive benchmarks and targeted cognitive tasks inspired by assessing human abilities but designed specifically for LLMs.