How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

📅 2025-02-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of defending generative models—particularly large vision-language models (LVLMs)—against jailbreaking attacks. We propose the first mechanism-driven defense analysis framework, which formalizes safety response as a binary classification task and decouples two core mechanisms: “safety shift” identification and “harmfulness discrimination.” To preserve model utility while enhancing robustness, we design both cross-mechanism and intra-mechanism ensemble strategies. Extensive experiments on LLaVA-1.5 using MM-SafetyBench and MOSSBench demonstrate that our method effectively mitigates jailbreaking attacks and systematically improves the safety–utility trade-off. Crucially, it offers an interpretable and reusable paradigm for LVLM safety mechanism design, advancing principled, transparent defense engineering for multimodal foundation models.

Technology Category

Application Category

📝 Abstract
Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.
Problem

Research questions and friction points this paper is trying to address.

Jailbreak attacks bypass model safety
Trade-offs between safety and helpfulness
Ensemble strategies improve model safety
Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary classification reframing
Safety shift mechanism
Harmfulness discrimination enhancement
🔎 Similar Papers
No similar papers found.
Z
Zhuohan Long
Fudan University
S
Siyuan Wang
University of Southern California
S
Shujun Liu
Fudan University
Yuhang Lai
Yuhang Lai
City University of Hong Kong
Natural Language ProcessingLarge Language Models
X
Xuanjing Huang
Fudan University
Z
Zhongyu Wei
Fudan University