Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models

📅 2024-06-25

🏛️ arXiv.org

📈 Citations: 26

✨ Influential: 4

career value

228K/year

🤖 AI Summary

This study investigates whether large language models (LLMs) possess stable, quantifiable psychological attributes analogous to human traits. Method: We introduce the first psychometric benchmark for LLMs, covering six core dimensions—personality, values, emotion, theory of mind, motivation, and intelligence—and integrating 13 diverse datasets. Drawing on classical psychometric paradigms—including item response theory and behavioral consistency analysis—we adapt them to LLM evaluation via multi-turn interactive prompting and rigorous validity assessment. Contribution/Results: We demonstrate that mainstream LLMs exhibit robust, cross-model and cross-prompt quantifiable psychological profiles (intra-class correlation > 0.7). Critically, we identify, for the first time, a systematic dissociation between self-reported and behaviorally manifested attributes. Furthermore, we establish the first reproducible, multidimensional, and cross-scenario framework for quantifying LLM psychological attributes—enabling rigorous, theory-grounded evaluation of model cognition and behavior.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. The broader integration of LLMs into society has sparked interest in whether they manifest psychological attributes, and whether these attributes are stable-inquiries that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a framework for investigating psychology in LLMs, including psychological dimension identification, assessment dataset curation, and assessment with results validation. Following this framework, we introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence. This benchmark includes thirteen datasets featuring diverse scenarios and item types. Our findings indicate that LLMs manifest a broad spectrum of psychological attributes. We also uncover discrepancies between LLMs' self-reported traits and their behaviors in real-world scenarios. This paper demonstrates a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Problem

Research questions and friction points this paper is trying to address.

Quantifying psychological constructs in Large Language Models through comprehensive benchmarking

Identifying discrepancies between self-reported traits and real-world response patterns

Assessing reliability of human-designed preference tests when applied to LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark quantifying psychological constructs of LLMs

Assessed five key psychological constructs through diverse datasets

Identified discrepancies between self-reported traits and real-world responses

🔎 Similar Papers

Assessment and manipulation of latent constructs in pre-trained language models using psychometric scales