The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

πŸ“… 2026-03-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Even after safety alignment, large language models remain vulnerable to continuation-triggered jailbreak attacks, yet the underlying mechanisms remain poorly understood. This work provides the first mechanistic explanation at the attention level, revealing that such jailbreaks arise from an inherent conflict between the model’s intrinsic drive to continue text and the safety constraints imposed by alignment training. Through causal intervention, activation scaling, and head-level interpretability analyses, we identify critical safety-related attention heads and systematically compare their behaviors across different model architectures. Our findings not only elucidate the root cause of jailbreak vulnerabilities but also offer theoretical insights and practical guidance for enhancing model robustness against such attacks.

Technology Category

Application Category

πŸ“ Abstract
With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.
Problem

Research questions and friction points this paper is trying to address.

jailbreak
large language models
safety alignment
continuation-triggered
mechanistic interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

continuation-triggered jailbreak
mechanistic interpretability
attention heads
safety alignment
causal intervention
πŸ”Ž Similar Papers
No similar papers found.
Y
Yonghong Deng
School of Computer Science & Technology, Beijing Institute of Technology
Z
Zhen Yang
School of Computer Science & Technology, Beijing Institute of Technology
Ping Jian
Ping Jian
Beijing Institute of Technology
natural language processingmachine learning
Xinyue Zhang
Xinyue Zhang
Southwest University of Science and Technology
Machine Learning Β· Multi-view clustering
Zhongbin Guo
Zhongbin Guo
Beijing Institute of Technology
Multimodal LLM
C
Chengzhi Li
School of Computer Science & Technology, Beijing Institute of Technology