🤖 AI Summary
This work investigates whether safety alignment mechanisms in large language models (LLMs) fundamentally constrain their capacity to authentically embody morally ambiguous or malevolent roles. Addressing a critical gap in prior research—namely, the lack of systematic evaluation of antisocial role-playing—the authors introduce a four-level Moral Alignment Scale and the first dedicated benchmark, Moral RolePlay, designed specifically to assess villainous character enactment. The framework integrates fine-grained personality trait analysis with a large-scale, multi-dimensional evaluation protocol. Experimental results demonstrate a systematic trade-off: higher safety alignment consistently degrades model performance on adversarial traits such as deception and manipulation. This constitutes the first empirical evidence of an intrinsic tension between safety alignment and role-playing fidelity. The study thus provides both theoretical grounding and a quantifiable assessment toolkit for reconciling AI safety with creative, expressive applications. (149 words)
📝 Abstract
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful''and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.