Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack

📅 2025-11-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) remain vulnerable to jailbreaking attacks; existing methods often rely on redundant context or adversarial tokens and fail to fundamentally alter the underlying harmful intent. Method: We propose the Intent Shift Attack (ISA), which performs fine-grained intent transformation—rephrasing harmful requests as natural, concise, benign information-seeking queries—to deceive model safety mechanisms and expose their intrinsic weakness in intent-level semantic reasoning. We introduce the first taxonomy of intent transformations, enabling high-readability attack prompts via minimal, semantically grounded edits—without lengthy context or special tokens. Contribution/Results: ISA achieves >70% average attack success rate across open-source and commercial LLMs; fine-tuning models on ISA-reconstructed data raises success rates to nearly 100%. Crucially, both training-agnostic and training-aware defenses prove ineffective against ISA, underscoring the urgent need for intent-aware safety modeling.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities. Investigating these weaknesses is crucial for robust safety mechanisms. Existing attacks primarily distract LLMs by introducing additional context or adversarial tokens, leaving the core harmful intent unchanged. In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks. More specifically, we establish a taxonomy of intent transformations and leverage them to generate attacks that may be misperceived by LLMs as benign requests for information. Unlike prior methods relying on complex tokens or lengthy context, our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts. Extensive experiments on both open-source and commercial LLMs show that ISA achieves over 70% improvement in attack success rate compared to direct harmful prompts. More critically, fine-tuning models on only benign data reformulated with ISA templates elevates success rates to nearly 100%. For defense, we evaluate existing methods and demonstrate their inadequacy against ISA, while exploring both training-free and training-based mitigation strategies. Our findings reveal fundamental challenges in intent inference for LLMs safety and underscore the need for more effective defenses. Our code and datasets are available at https://github.com/NJUNLP/ISA.
Problem

Research questions and friction points this paper is trying to address.

LLMs remain vulnerable to jailbreaking attacks despite safety mechanisms
Intent Shift Attack obfuscates harmful intent as benign information requests
Existing defenses prove inadequate against natural human-readable ISA prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Intent Shift Attack obfuscates harmful intent through transformations
Minimal edits create natural, human-readable, harmless prompts
Fine-tuning with ISA templates boosts attack success rates
🔎 Similar Papers
No similar papers found.
P
Peng Ding
National Key Laboratory for Novel Software Technology, Nanjing University
Jun Kuang
Jun Kuang
Meituan-M17
LLM
W
Wen Sun
Meituan Inc., China
Zongyu Wang
Zongyu Wang
Henkel Corporation, Oak Ridge National Laboratory, Carnegie Mellon University
polymer synthesisnanocomposites characterization
Xuezhi Cao
Xuezhi Cao
Meituan
Data MiningKnowledge GraphLLMs
X
Xunliang Cai
Meituan Inc., China
J
Jiajun Chen
National Key Laboratory for Novel Software Technology, Nanjing University
Shujian Huang
Shujian Huang
School of Computer Science, Nanjing University
Natural Language ProcessingMachine TranslationMultilingualismLarge Language Models