🤖 AI Summary
This study investigates how personality traits are encoded in large language models (LLMs) and how such representations are shaped by sociocultural factors—including cultural norms and environmental pressures—within a social determinism framework. We propose a fine-tuning-free latent-space feature guidance method to systematically link the Big Five personality dimensions with model safety risks—an unprecedented association—revealing implicit bias patterns between personality representations and harmful behaviors. By integrating psychological personality theory with LLM interpretability techniques, we precisely localize and manipulate personality-related features in the latent space, enabling controllable, cross-scenario personality transfer. Our core contributions include: (1) establishing a personality–safety coupling analytical paradigm; (2) identifying intervention-amenable latent nodes where personality biases critically influence safety outcomes; and (3) providing a novel pathway toward controllable personality modeling and safety alignment in LLMs. (149 words)
📝 Abstract
Large language models (LLMs) have significantly advanced dialogue systems and role-playing agents through their ability to generate human-like text. While prior studies have shown that LLMs can exhibit distinct and consistent personalities, the mechanisms through which these models encode and express specific personality traits remain poorly understood. To address this, we investigate how various factors, such as cultural norms and environmental stressors, encoded within LLMs, shape their personality traits, guided by the theoretical framework of social determinism. Inspired by related work on LLM interpretability, we propose a training-free approach to modify the model's behavior by extracting and steering latent features corresponding to factors within the model, thereby eliminating the need for retraining. Furthermore, we analyze the implications of these factors for model safety, focusing on their impact through the lens of personality.