đ¤ AI Summary
This study reveals an intrinsic instability in large language models (LLMs) for personality assessment: minor prompt perturbationsâsuch as item reorderingâinduce measurement shifts up to 20%, with standard deviations exceeding 0.4 even in models with >400B parameters. Method: To systematically evaluate this phenomenon, the authors introduce PERSISTâa framework integrating classical psychometric instruments (BFI-44, SD3) and novel LLM-adapted measuresâapplied across 25+ open-source models and over 500,000 responses. Contribution/Results: Contrary to prevailing assumptions, widely adopted stabilization techniquesâincluding chain-of-thought prompting, role assignment, and dialogue historyâare not only ineffective but exacerbate response variability. This confirms that instability is fundamentally architectural, not methodological. The findings critically challenge the foundational premise that âpersonality alignmentâ ensures safe, reliable LLM deployment, demonstrating that current LLMs lack the internal consistency required for stable behavioral inference.
đ Abstract
Large language models require consistent behavioral patterns for safe deployment, yet their personality-like traits remain poorly understood. We present PERSIST (PERsonality Stability in Synthetic Text), a comprehensive evaluation framework testing 25+ open-source models (1B-671B parameters) across 500,000+ responses. Using traditional (BFI-44, SD3) and novel LLM-adapted personality instruments, we systematically vary question order, paraphrasing, personas, and reasoning modes. Our findings challenge fundamental deployment assumptions: (1) Even 400B+ models exhibit substantial response variability (SD > 0.4); (2) Minor prompt reordering alone shifts personality measurements by up to 20%; (3) Interventions expected to stabilize behavior, such as chain-of-thought reasoning, detailed personas instruction, inclusion of conversation history, can paradoxically increase variability; (4) LLM-adapted instruments show equal instability to human-centric versions, confirming architectural rather than translational limitations. This persistent instability across scales and mitigation strategies suggests current LLMs lack the foundations for genuine behavioral consistency. For safety-critical applications requiring predictable behavior, these findings indicate that personality-based alignment strategies may be fundamentally inadequate.