Too Good to be Bad: On the Failure of LLMs to Role-Play Villains

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work investigates whether safety alignment mechanisms in large language models (LLMs) fundamentally constrain their capacity to authentically embody morally ambiguous or malevolent roles. Addressing a critical gap in prior research—namely, the lack of systematic evaluation of antisocial role-playing—the authors introduce a four-level Moral Alignment Scale and the first dedicated benchmark, Moral RolePlay, designed specifically to assess villainous character enactment. The framework integrates fine-grained personality trait analysis with a large-scale, multi-dimensional evaluation protocol. Experimental results demonstrate a systematic trade-off: higher safety alignment consistently degrades model performance on adversarial traits such as deception and manipulation. This constitutes the first empirical evidence of an intrinsic tension between safety alignment and role-playing fidelity. The study thus provides both theoretical grounding and a quantifiable assessment toolkit for reconciling AI safety with creative, expressive applications. (149 words)

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ``Deceitful''and ``Manipulative'', often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.

Problem

Research questions and friction points this paper is trying to address.

LLMs struggle to authentically role-play morally ambiguous or villainous characters

Safety alignment creates fundamental conflict with portraying non-prosocial personas

Models show declining role-playing fidelity as character morality decreases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduced Moral RolePlay benchmark for evaluation

Tested LLMs on four-level moral alignment scale

Revealed safety alignment limits villain role-playing fidelity

🔎 Similar Papers

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation

2024-07-11IFIP Working Conference on Database SemanticsCitations: 3

💼 Related Jobs

Researcher, Alignment

OpenAI

$250K – $445K • Offers Equity

San Francisco, CA, USA

AI Research Scientist - Safety Alignment Team