🤖 AI Summary
Existing attack methods on vision-language models suffer from slow convergence and low efficiency due to gradient conflicts with safety alignment mechanisms. This work proposes an adversarial attention hijacking approach that bypasses—rather than overrides—these safety mechanisms through a "push-pull" attention guidance strategy, causing the model to disregard safety instruction prefixes and instead focus on adversarial image features. The method reveals a novel "safety blindness" mechanism: successful attacks stem from the model’s inability to retrieve safety rules, not from deliberate rule violation. By integrating attention masking with adversarial feature anchoring, the approach effectively manipulates the model’s internal attention distribution. Evaluated on Qwen-VL, it achieves a 94.4% attack success rate (versus 68.8% for the baseline) with 40% fewer iterations; even under strict perturbation constraints (ε=8/255), it maintains a 59.0% success rate.
📝 Abstract
Large Vision-Language Models (LVLMs) rely on attention-based retrieval of safety instructions to maintain alignment during generation. Existing attacks typically optimize image perturbations to maximize harmful output likelihood, but suffer from slow convergence due to gradient conflict between adversarial objectives and the model's safety-retrieval mechanism. We propose Attention-Guided Visual Jailbreaking, which circumvents rather than overpowers safety alignment by directly manipulating attention patterns. Our method introduces two simple auxiliary objectives: (1) suppressing attention to alignment-relevant prefix tokens and (2) anchoring generation on adversarial image features. This simple yet effective push-pull formulation reduces gradient conflict by 45% and achieves 94.4% attack success rate on Qwen-VL (vs. 68.8% baseline) with 40% fewer iterations. At tighter perturbation budgets ($ε=8/255$), we maintain 59.0% ASR compared to 45.7% for standard methods. Mechanistic analysis reveals a failure mode we term safety blindness: successful attacks suppress system-prompt attention by 80%, causing models to generate harmful content not by overriding safety rules, but by failing to retrieve them.