🤖 AI Summary
Existing LLM personality assessments rely on context-free, isolated questionnaires (e.g., the “Disney World Test”), failing to capture behavioral consistency in realistic, multi-turn dialogues. Method: We propose CAPE—the first context-aware personality evaluation framework—systematically integrating dialogue history into psychometric assessment and introducing a response consistency metric to quantify contextual effects on personality stability and drift. Contribution/Results: Evaluating seven mainstream LLMs, we find that GPT-series models exhibit pronounced personality drift under historical influence; Gemini and Llama show heightened sensitivity to question ordering; and role-playing scenarios yield outputs better aligned with human judgments. CAPE establishes a novel paradigm for trustworthy LLM personality modeling and provides a reproducible, context-sensitive benchmark for rigorous evaluation.
📝 Abstract
Psychometric tests, traditionally used to assess humans, are now being applied to Large Language Models (LLMs) to evaluate their behavioral traits. However, existing studies follow a context-free approach, answering each question in isolation to avoid contextual influence. We term this the Disney World test, an artificial setting that ignores real-world applications, where conversational history shapes responses. To bridge this gap, we propose the first Context-Aware Personality Evaluation (CAPE) framework for LLMs, incorporating prior conversational interactions. To thoroughly analyze the influence of context, we introduce novel metrics to quantify the consistency of LLM responses, a fundamental trait in human behavior.
Our exhaustive experiments on 7 LLMs reveal that conversational history enhances response consistency via in-context learning but also induces personality shifts, with GPT-3.5-Turbo and GPT-4-Turbo exhibiting extreme deviations. While GPT models are robust to question ordering, Gemini-1.5-Flash and Llama-8B display significant sensitivity. Moreover, GPT models response stem from their intrinsic personality traits as well as prior interactions, whereas Gemini-1.5-Flash and Llama--8B heavily depend on prior interactions. Finally, applying our framework to Role Playing Agents (RPAs) shows context-dependent personality shifts improve response consistency and better align with human judgments. Our code and datasets are publicly available at: https://github.com/jivnesh/CAPE