HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) are particularly vulnerable to multimodal jailbreaking attacks due to their joint processing of visual and linguistic inputs; however, existing safety research relies predominantly on post-hoc alignment, leaving internal safety mechanisms unexplored. Method: We discover, for the first time, that LVLMs exhibit distinguishable and robust activation patterns in hidden layers when processing unsafe inputs—patterns that are exploitable without fine-tuning. Leveraging this observation, we propose a zero-shot, parameter-free latent-state monitoring framework that employs lightweight anomaly detection to identify adversarial inputs in real time. Contribution/Results: Evaluated across multiple LVLM benchmarks, our method significantly outperforms state-of-the-art approaches—achieving up to a 12.7% improvement in detection accuracy—while demonstrating strong generalization across models and architectures and enabling efficient real-time deployment. This work breaks from conventional post-processing alignment paradigms by enabling proactive, intrinsic safety monitoring.

Technology Category

Application Category

📝 Abstract
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
Problem

Research questions and friction points this paper is trying to address.

Detecting jailbreak attacks in LVLMs
Monitoring internal activation patterns
Enhancing safety without fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Monitors hidden states
Detects jailbreak attacks
Tuning-free safety framework
🔎 Similar Papers
No similar papers found.
Y
Yilei Jiang
MMLab, The Chinese University of Hong Kong
X
Xinyan Gao
MMLab, The Chinese University of Hong Kong
Tianshuo Peng
Tianshuo Peng
MMLab, CUHK
Y
Yingshui Tan
Future Lab, Alibaba Group
Xiaoyong Zhu
Xiaoyong Zhu
Jiangsu University
Electrical MachinesElectrical Vehicle
B
Bo Zheng
Future Lab, Alibaba Group
Xiangyu Yue
Xiangyu Yue
The Chinese University of Hong Kong / UC Berkeley / Stanford University / NJU
Artificial IntelligenceComputer VisionMulti-modal Learning