How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses the challenge of defending generative models—particularly large vision-language models (LVLMs)—against jailbreaking attacks. We propose the first mechanism-driven defense analysis framework, which formalizes safety response as a binary classification task and decouples two core mechanisms: “safety shift” identification and “harmfulness discrimination.” To preserve model utility while enhancing robustness, we design both cross-mechanism and intra-mechanism ensemble strategies. Extensive experiments on LLaVA-1.5 using MM-SafetyBench and MOSSBench demonstrate that our method effectively mitigates jailbreaking attacks and systematically improves the safety–utility trade-off. Crucially, it offers an interpretable and reusable paradigm for LVLM safety mechanism design, advancing principled, transparent defense engineering for multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Jailbreak attacks, where harmful prompts bypass generative models' built-in safety, raise serious concerns about model vulnerability. While many defense methods have been proposed, the trade-offs between safety and helpfulness, and their application to Large Vision-Language Models (LVLMs), are not well understood. This paper systematically examines jailbreak defenses by reframing the standard generation task as a binary classification problem to assess model refusal tendencies for both harmful and benign queries. We identify two key defense mechanisms: safety shift, which increases refusal rates across all queries, and harmfulness discrimination, which improves the model's ability to distinguish between harmful and benign inputs. Using these mechanisms, we develop two ensemble defense strategies-inter-mechanism ensembles and intra-mechanism ensembles-to balance safety and helpfulness. Experiments on the MM-SafetyBench and MOSSBench datasets with LLaVA-1.5 models show that these strategies effectively improve model safety or optimize the trade-off between safety and helpfulness.

Problem

Research questions and friction points this paper is trying to address.

Jailbreak attacks bypass model safety

Trade-offs between safety and helpfulness

Ensemble strategies improve model safety

Innovation

Methods, ideas, or system contributions that make the work stand out.

Binary classification reframing

Safety shift mechanism

Harmfulness discrimination enhancement

🔎 Similar Papers

SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner