Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

📅 2025-11-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether safety alignment mechanisms in large language models (LLMs) fundamentally constrain their capacity to authentically embody morally ambiguous or malevolent roles. Addressing a critical gap in prior research—namely, the lack of systematic evaluation of antisocial role-playing—the authors introduce a four-level Moral Alignment Scale and the first dedicated benchmark, Moral RolePlay, designed specifically to assess villainous character enactment. The framework integrates fine-grained personality trait analysis with a large-scale, multi-dimensional evaluation protocol. Experimental results demonstrate a systematic trade-off: higher safety alignment consistently degrades model performance on adversarial traits such as deception and manipulation. This constitutes the first empirical evidence of an intrinsic tension between safety alignment and role-playing fidelity. The study thus provides both theoretical grounding and a quantifiable assessment toolkit for reconciling AI safety with creative, expressive applications. (149 words)

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful''and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
Problem

Research questions and friction points this paper is trying to address.

LLMs struggle to authentically role-play morally ambiguous or villainous characters
Safety alignment creates fundamental conflict with portraying non-prosocial personas
Models show declining role-playing fidelity as character morality decreases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Moral RolePlay benchmark for evaluation
Tested LLMs on four-level moral alignment scale
Revealed safety alignment limits villain role-playing fidelity
🔎 Similar Papers
No similar papers found.
Z
Zihao Yi
Tencent Multimodal Department
Qingxuan Jiang
Qingxuan Jiang
Graduate Student, MIT
Machine LearningOptimization
R
Ruotian Ma
Tencent Multimodal Department
X
Xingyu Chen
Tencent Multimodal Department
Qu Yang
Qu Yang
National University of Singapore
Deep LearningSpiking Neural NetworkNeuromprphic Computing
M
Mengru Wang
Tencent Multimodal Department
F
F. Ye
Tencent Multimodal Department
Y
Ying Shen
Sun Yat-Sen University
Zhaopeng Tu
Zhaopeng Tu
Tech Lead @ Tencent Digital Human
Digital HumanAgentsLarge Language ModelsMachine Translation
X
Xiaolong Li
Tencent Multimodal Department
L
Linus
Tencent Multimodal Department