🤖 AI Summary
This study investigates whether large language models (LLMs) can reliably reproduce human social decision-making under personality prompting—a core assumption in personality alignment research. Method: Drawing on canonical psychological paradigms—Milgram’s obedience experiment and the ultimatum game—we construct a cross-vendor (OpenAI, Anthropic, Google, Meta), multi-prompt social behavior evaluation framework. Contribution/Results: Empirical results demonstrate that all evaluated models fail to consistently instantiate target personality traits in social contexts, exhibiting systematic deviations in obedience and fairness tasks. Critically, this failure is robust across model architectures and prompt perturbations. To our knowledge, this is the first systematic demonstration of the pervasive ineffectiveness of personality prompting in social decision-making scenarios. These findings challenge prevailing optimistic assumptions in personality alignment literature and indicate that current approaches remain in an early, foundational stage.
📝 Abstract
The ongoing revolution in language modeling has led to various novel applications, some of which rely on the emerging social abilities of large language models (LLMs). Already, many turn to the new cyber friends for advice during the pivotal moments of their lives and trust them with the deepest secrets, implying that accurate shaping of the LLM's personality is paramount. To this end, state-of-the-art approaches exploit a vast variety of training data, and prompt the model to adopt a particular personality. We ask (i) if personality-prompted models behave (i.e., make decisions when presented with a social situation) in line with the ascribed personality (ii) if their behavior can be finely controlled. We use classic psychological experiments, the Milgram experiment and the Ultimatum Game, as social interaction testbeds and apply personality prompting to open- and closed-source LLMs from 4 different vendors. Our experiments reveal failure modes of the prompt-based modulation of the models' behavior that are shared across all models tested and persist under prompt perturbations. These findings challenge the optimistic sentiment toward personality prompting generally held in the community.