HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) are particularly vulnerable to multimodal jailbreaking attacks due to their joint processing of visual and linguistic inputs; however, existing safety research relies predominantly on post-hoc alignment, leaving internal safety mechanisms unexplored. Method: We discover, for the first time, that LVLMs exhibit distinguishable and robust activation patterns in hidden layers when processing unsafe inputs—patterns that are exploitable without fine-tuning. Leveraging this observation, we propose a zero-shot, parameter-free latent-state monitoring framework that employs lightweight anomaly detection to identify adversarial inputs in real time. Contribution/Results: Evaluated across multiple LVLM benchmarks, our method significantly outperforms state-of-the-art approaches—achieving up to a 12.7% improvement in detection accuracy—while demonstrating strong generalization across models and architectures and enabling efficient real-time deployment. This work breaks from conventional post-processing alignment paradigms by enabling proactive, intrinsic safety monitoring.

Technology Category

Application Category

📝 Abstract

The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.

Problem

Research questions and friction points this paper is trying to address.

Detecting jailbreak attacks in LVLMs

Monitoring internal activation patterns

Enhancing safety without fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monitors hidden states

Detects jailbreak attacks

Tuning-free safety framework

🔎 Similar Papers

Investigating Coverage Criteria in Large Language Models: An In-Depth Study Through Jailbreak Attacks