Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large Vision-Language Models (LVLMs) are vulnerable to jailbreaking attacks, yet existing detection methods suffer from poor generalizability or reliance on inefficient heuristic rules. To address this, we propose LoD—a novel unsupervised framework for efficient detection of *unseen* jailbreaking attacks. LoD first constructs Multimodal Safety Concept Activation Vectors (SCAVs) to explicitly model cross-modal safety semantics; it then introduces the Safety Pattern Autoencoder (SPAE), which learns the implicit safety distribution underlying benign multimodal interactions and identifies anomalous deviations. Crucially, LoD requires no attack samples for training, introduces zero additional parameters, and achieves both strong generalizability and high computational efficiency. Evaluated across diverse unseen jailbreaking attacks, LoD improves average AUROC by 12.7% over state-of-the-art baselines while incurring only a 3.2% increase in inference latency—demonstrating significant gains in both detection accuracy and practical deployability.

Technology Category

Application Category

📝 Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
Problem

Research questions and friction points this paper is trying to address.

Detecting unknown jailbreak attacks in large vision-language models
Overcoming limitations of attack-specific detection methods
Improving accuracy and efficiency for unseen attack detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Shifts focus from attack-specific to task-specific learning
Uses multimodal safety vectors for representation learning
Employs autoencoder for unsupervised attack classification
S
Shuang Liang
Renmin University of China
Z
Zhihao Xu
Renmin University of China
Jialing Tao
Jialing Tao
Alibaba
H
Hui Xue
Alibaba Group
Xiting Wang
Xiting Wang
Associate Professor, Renmin University of China
Explainable AIAI AlignmentVisual AnalyticsTrustworthy AIReasoning