🤖 AI Summary
Large vision-language models (LVLMs) exhibit opaque safety mechanisms under adversarial prompts, hindering reliable detection and mitigation of prompt-based attacks.
Method: This work首次 identifies and defines an intrinsic, sparse attention head—the “safety head”—that exhibits prompt-specific activation during the first-token generation phase in response to malicious inputs. Leveraging its interpretable, early-stage activation pattern, we propose a zero-shot, lightweight malicious prompt detector: it localizes the safety head, concatenates its cross-layer activations, and applies logistic regression for classification—requiring no fine-tuning or labeled data.
Contribution/Results: Extensive experiments demonstrate that our method significantly reduces attack success rates across diverse prompt injection attacks while preserving the LVLM’s original task performance and incurring negligible inference overhead (<0.5% latency increase). The approach establishes a new paradigm for LVLM safety: localizable, reusable, and minimally intrusive.
📝 Abstract
With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads."Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a straightforward but powerful malicious prompt detector that integrates seamlessly into the generation process with minimal extra inference overhead. Despite its simple structure of a logistic regression model, the detector surprisingly exhibits strong zero-shot generalization capabilities. Experiments across various prompt-based attacks confirm the effectiveness of leveraging safety heads to protect LVLMs. Code is available at url{https://github.com/Ziwei-Zheng/SAHs}.