Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Vision-language models (VLMs) exhibit significantly degraded safety performance on image inputs compared to text-only settings, primarily due to the modality gap between image and text representations incurred during pretraining. This work first quantitatively establishes a strong negative correlation between this modality gap and the unsafe response rate. Building on this finding, we propose the first pretraining-stage modality alignment regularization method, which employs contrastive learning to narrow the cross-modal semantic gap. Evaluated on mainstream VLMs—including LLaVA-v1.5, ShareGPT4V, and MiniGPT-4—our approach reduces the unsafe response rate by up to 16.3%. Moreover, it is fully compatible with existing safety mitigation techniques, yielding a synergistic improvement of 18.2% in overall safety performance. Crucially, these gains are achieved without compromising model capability, thereby achieving a dual breakthrough: enhanced safety and preserved multimodal understanding performance.

Technology Category

Application Category

📝 Abstract

Ensuring Vision-Language Models (VLMs) generate safe outputs is crucial for their reliable deployment. However, LVLMs suffer from drastic safety degradation compared to their LLM backbone. Even blank or irrelevant images can trigger LVLMs to generate harmful responses to prompts that would otherwise be refused in text-only contexts. The modality gap between image and text representations has been recently hypothesized to contribute to safety degradation of LVLMs. However, if and how the amount of modality gap affects LVLMs' safety is not studied. In this work, we show that the amount of modality gap is highly inversely correlated with VLMs' safety. Then, we show that this modality gap is introduced during pretraining LVLMs and persists through fine-tuning. Inspired by this observation, we propose a regularization to reduce the modality gap during pretraining. Our extensive experiments on LLaVA v1.5, ShareGPT4V, and MiniGPT-4 show that our method substantially improves safety alignment of LVLMs, reducing unsafe rate by up to 16.3% without compromising performance, and can further boost existing defenses by up to 18.2%.

Problem

Research questions and friction points this paper is trying to address.

Reducing modality gap to improve VLM safety

Studying impact of modality gap on LVLM safety

Enhancing safety alignment without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces modality gap in pretraining

Improves VLM safety alignment

Boosts existing defenses significantly

🔎 Similar Papers

Cross-Modal Safety Alignment: Is textual unlearning all you need?