Jailbreaking Large Language Models with Morality Attacks

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

223K/year
🤖 AI Summary
This work reveals significant robustness deficiencies in large language models under diverse moral value systems, particularly their susceptibility to morality-related adversarial attacks. To address this, we construct a 10.3K-scale moral dataset encompassing value ambiguity and conflict, and introduce the jailbreaking attack paradigm into moral alignment research for the first time. We design four categories of morality-aware adversarial attacks to systematically evaluate the stability of mainstream models and their safety mechanisms in complex moral scenarios. Experimental results demonstrate that current models struggle to maintain consistent moral judgments, highlighting the fragility of their internalized values. Our study establishes the first targeted adversarial evaluation framework for moral alignment, offering a foundation for more robust and ethically reliable language models.

Technology Category

Application Category

📝 Abstract
Pluralism alignment with AI has the sophisticated and necessary goal of creating AI that can coexist with and serve morally multifaceted humanity. Research towards pluralism alignment has many efforts in enhancing the learning of large language models (LLMs) to accomplish pluralism. Although this is essential, the robustness of LLMs to produce moral content over pluralistic values is still under exploration.Inspired by the astonishing persuasion abilities via jailbreak prompts, we propose to leverage jailbreak attacks to study LLMs' internal pluralistic values. In detail, we develop a morality dataset with 10.3K instances in two categories: Value Ambiguity and Value Conflict. We further formalize four adversarial attacks with the constructed dataset, to manipulate LLMs' judgment over the morality questions. We evaluate both the large language models and guardrail models which are typically used in generative systems with flexible user input. Our experiment results show that there is a critical vulnerability of LLMs and guardrail models to these subtle and sophisticated moral-aware attacks.
Problem

Research questions and friction points this paper is trying to address.

jailbreaking
morality attacks
pluralism alignment
large language models
adversarial robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

morality attacks
jailbreaking
pluralism alignment
adversarial evaluation
value conflict
🔎 Similar Papers
2024-07-01Conference on Empirical Methods in Natural Language ProcessingCitations: 2