Mind Reading or Misreading? LLMs on the Big Five Personality Test

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This study systematically evaluates the applicability of large language models (LLMs) to zero-shot author personality prediction (APPT) under the Big Five personality framework. We benchmark GPT-4 against multiple lightweight open-source LLMs across three heterogeneous datasets—Essays, MyPersonality, and Pandora—using both minimal prompting and linguistics- and psychology-informed enhanced prompting strategies. Results reveal that Openness and Agreeableness are relatively easier to predict, whereas Extraversion and Neuroticism remain challenging; all models exhibit class imbalance bias and prediction instability, with macro-level metrics obscuring fine-grained performance disparities—necessitating per-class recall analysis. Our core contribution is the identification of a synergistic bias mechanism between prompt design and personality dimension framing, empirically demonstrating that current LLMs lack stable and reliable zero-shot personality inference capability.

Technology Category

Application Category

📝 Abstract

We evaluate large language models (LLMs) for automatic personality prediction from text under the binary Five Factor Model (BIG5). Five models -- including GPT-4 and lightweight open-source alternatives -- are tested across three heterogeneous datasets (Essays, MyPersonality, Pandora) and two prompting strategies (minimal vs. enriched with linguistic and psychological cues). Enriched prompts reduce invalid outputs and improve class balance, but also introduce a systematic bias toward predicting trait presence. Performance varies substantially: Openness and Agreeableness are relatively easier to detect, while Extraversion and Neuroticism remain challenging. Although open-source models sometimes approach GPT-4 and prior benchmarks, no configuration yields consistently reliable predictions in zero-shot binary settings. Moreover, aggregate metrics such as accuracy and macro-F1 mask significant asymmetries, with per-class recall offering clearer diagnostic value. These findings show that current out-of-the-box LLMs are not yet suitable for APPT, and that careful coordination of prompt design, trait framing, and evaluation metrics is essential for interpretable results.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for automatic personality prediction from text

Testing models across datasets and prompting strategies for trait detection

Assessing reliability and biases in zero-shot binary personality classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enriched prompts with linguistic cues reduce invalid outputs

Binary Five Factor Model for automatic personality prediction

Open-source models approach GPT-4 performance in some cases

🔎 Similar Papers

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics