Persistent Instability in LLM's Personality Measurements: Effects of Scale, Reasoning, and Conversation History

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This study reveals an intrinsic instability in large language models (LLMs) for personality assessment: minor prompt perturbations—such as item reordering—induce measurement shifts up to 20%, with standard deviations exceeding 0.4 even in models with >400B parameters. Method: To systematically evaluate this phenomenon, the authors introduce PERSIST—a framework integrating classical psychometric instruments (BFI-44, SD3) and novel LLM-adapted measures—applied across 25+ open-source models and over 500,000 responses. Contribution/Results: Contrary to prevailing assumptions, widely adopted stabilization techniques—including chain-of-thought prompting, role assignment, and dialogue history—are not only ineffective but exacerbate response variability. This confirms that instability is fundamentally architectural, not methodological. The findings critically challenge the foundational premise that “personality alignment” ensures safe, reliable LLM deployment, demonstrating that current LLMs lack the internal consistency required for stable behavioral inference.

Technology Category

Application Category

📝 Abstract

Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.

Problem

Research questions and friction points this paper is trying to address.

Assessing personality trait instability in large language models

Evaluating impact of prompt variations on measurement consistency

Testing behavioral stabilization interventions that paradoxically increase variability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive framework testing 25+ models

Systematic variation of prompts and reasoning

Analysis of behavioral inconsistency across scales

🔎 Similar Papers

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics