🤖 AI Summary
This work investigates the root causes of variability in generalizability among suffix-based LLM jailbreaking attacks—specifically, why certain adversarial suffixes transfer effectively to numerous unseen harmful instructions. We introduce a novel analytical framework integrating attention flow analysis and context-sensitivity quantification, establishing for the first time a causal link between suffix generalizability and the strength of shallow-layer attention hijacking: highly generalizable suffixes disrupt template contextualization and hijack early attention pathways to achieve cross-instruction transfer. Building on this insight, we propose two innovations: (1) a zero-overhead method that boosts suffix generalizability by up to 5×; and (2) a surgical mitigation mechanism that reduces attack success rate by ~50% while preserving near-original model performance on benign tasks. Our findings establish an interpretable, controllable paradigm for both understanding and defending against jailbreaking attacks.
📝 Abstract
We study suffix-based jailbreaks$unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack (Zou et al., 2023), we observe that suffixes vary in efficacy: some markedly more universal$unicode{x2013}$generalizing to many unseen harmful instructions$unicode{x2013}$than others. We first show that GCG's effectiveness is driven by a shallow, critical mechanism, built on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG universality can be efficiently enhanced (up to $ imes$5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving attack success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.