Attention Slipping: A Mechanistic Understanding of Jailbreak Attacks and Defenses in LLMs

📅 2025-07-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

To address the lack of mechanistic understanding of jailbreaking attacks against large language models (LLMs) and the high computational overhead of existing defenses, this paper identifies and empirically validates “attention slippage”—a systematic attenuation of attention weights toward harmful tokens during jailbreaking—as a cross-paradigm commonality across diverse attack strategies. Building on this insight, we propose “attention sharpening,” a defense that recalibrates attention score distributions via temperature scaling, thereby enhancing safety responses without incurring additional computational or memory cost. Evaluated on four mainstream LLMs, our method significantly improves robustness against gradient-based search, prompt template optimization, and in-context learning jailbreaking attacks, while preserving original task performance on AlpacaEval—outperforming baselines including Token Highlighter and SmoothLLM.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) become more integral to society and technology, ensuring their safety becomes essential. Jailbreak attacks exploit vulnerabilities to bypass safety guardrails, posing a significant threat. However, the mechanisms enabling these attacks are not well understood. In this paper, we reveal a universal phenomenon that occurs during jailbreak attacks: Attention Slipping. During this phenomenon, the model gradually reduces the attention it allocates to unsafe requests in a user query during the attack process, ultimately causing a jailbreak. We show Attention Slipping is consistent across various jailbreak methods, including gradient-based token replacement, prompt-level template refinement, and in-context learning. Additionally, we evaluate two defenses based on query perturbation, Token Highlighter and SmoothLLM, and find they indirectly mitigate Attention Slipping, with their effectiveness positively correlated with the degree of mitigation achieved. Inspired by this finding, we propose Attention Sharpening, a new defense that directly counters Attention Slipping by sharpening the attention score distribution using temperature scaling. Experiments on four leading LLMs (Gemma2-9B-It, Llama3.1-8B-It, Qwen2.5-7B-It, Mistral-7B-It v0.2) show that our method effectively resists various jailbreak attacks while maintaining performance on benign tasks on AlpacaEval. Importantly, Attention Sharpening introduces no additional computational or memory overhead, making it an efficient and practical solution for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Understanding mechanisms behind jailbreak attacks in LLMs

Evaluating defenses against Attention Slipping phenomenon

Proposing Attention Sharpening to mitigate jailbreak vulnerabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies Attention Slipping in jailbreak attacks

Proposes Attention Sharpening defense via temperature scaling

Ensures safety without computational overhead

🔎 Similar Papers

No similar papers found.

Authors to Follow