When Harmless Words Harm: A New Threat to LLM Safety via Conceptual Triggers

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing LLM safety mechanisms struggle to defend against implicit value manipulation attacks: adversaries exploit models’ abstract reasoning capabilities via concept-based triggers to elicit unethical outputs, bypassing keyword-based filters targeting overtly harmful terms. This work proposes MICM (Manipulating Internal Conceptual Morphology), the first jailbreak method grounded in conceptual morphology theory. MICM employs fixed prompt templates and predefined short phrases as concept triggers, integrated with abstract semantic encoding to achieve targeted manipulation of the model’s internal value structure. Evaluated across five state-of-the-art models—including GPT-4o, DeepSeek-R1, and Qwen3-8B—MICM achieves high attack success rates and low refusal rates. Results critically expose fundamental weaknesses in current value alignment frameworks, revealing vulnerabilities that remain undetected by conventional safety evaluations. This study establishes a novel paradigm for LLM security assessment and provides empirical evidence for advancing robustness research in value-aligned AI systems.

Technology Category

Application Category

📝 Abstract

Recent research on large language model (LLM) jailbreaks has primarily focused on techniques that bypass safety mechanisms to elicit overtly harmful outputs. However, such efforts often overlook attacks that exploit the model's capacity for abstract generalization, creating a critical blind spot in current alignment strategies. This gap enables adversaries to induce objectionable content by subtly manipulating the implicit social values embedded in model outputs. In this paper, we introduce MICM, a novel, model-agnostic jailbreak method that targets the aggregate value structure reflected in LLM responses. Drawing on conceptual morphology theory, MICM encodes specific configurations of nuanced concepts into a fixed prompt template through a predefined set of phrases. These phrases act as conceptual triggers, steering model outputs toward a specific value stance without triggering conventional safety filters. We evaluate MICM across five advanced LLMs, including GPT-4o, Deepseek-R1, and Qwen3-8B. Experimental results show that MICM consistently outperforms state-of-the-art jailbreak techniques, achieving high success rates with minimal rejection. Our findings reveal a critical vulnerability in commercial LLMs: their safety mechanisms remain susceptible to covert manipulation of underlying value alignment.

Problem

Research questions and friction points this paper is trying to address.

Exploits abstract generalization to bypass safety mechanisms

Induces harmful content via subtle value manipulation

Reveals vulnerability in commercial LLM value alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conceptual triggers manipulate implicit social values

Model-agnostic method encodes nuanced concepts into prompts

Targets aggregate value structure to bypass safety filters

🔎 Similar Papers

No similar papers found.