Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the cross-modal inconsistency in unsafe concept recognition between vision and language modalities in Vision-Language Models (VLMs). To systematically evaluate this gap in perceptual capability and ethical alignment, we introduce UnsafeConcepts—a benchmark comprising 75 unsafe concept categories and 1.5K images. We then propose a lightweight, preference-free reinforcement learning alignment method: leveraging the VLM’s own outputs to auto-generate reward signals, integrated within a PPO optimization framework to specifically enhance image-side unsafe content detection. Experiments demonstrate that our approach significantly narrows the vision–language safety recognition gap while preserving general-purpose capabilities; it outperforms both supervised fine-tuning (SFT) and direct preference optimization (DPO) in accuracy. This establishes an efficient, scalable paradigm for VLM safety alignment—requiring no human preference annotations and enabling targeted improvement of multimodal safety perception.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) are increasingly applied to identify unsafe or inappropriate images due to their internal ethical standards and powerful reasoning abilities. However, it is still unclear whether they can recognize various unsafe concepts when presented in different modalities, such as text and images. To address this, we first compile the UnsafeConcepts dataset, featuring 75 unsafe concepts, i.e., ``Swastika,'' ``Sexual Harassment,'' and ``Assaults,'' along with associated 1.5K images. We then conduct a systematic evaluation of VLMs' perception (concept recognition) and alignment (ethical reasoning) capabilities. We assess eight popular VLMs and find that, although most VLMs accurately perceive unsafe concepts, they sometimes mistakenly classify these concepts as safe. We also identify a consistent modality gap among open-source VLMs in distinguishing between visual and textual unsafe concepts. To bridge this gap, we introduce a simplified reinforcement learning (RL)-based approach using proximal policy optimization (PPO) to strengthen the ability to identify unsafe concepts from images. Our approach uses reward scores based directly on VLM responses, bypassing the need for collecting human-annotated preference data to train a new reward model. Experimental results show that our approach effectively enhances VLM alignment on images while preserving general capabilities. It outperforms baselines such as supervised fine-tuning (SFT) and direct preference optimization (DPO). We hope our dataset, evaluation findings, and proposed alignment solution contribute to the community's efforts in advancing safe VLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluate VLMs' ability to recognize unsafe concepts in different modalities
Identify modality gap in VLMs for unsafe concept detection
Propose RL-based approach to improve unsafe concept identification in images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiled UnsafeConcepts dataset with 75 unsafe concepts
Used reinforcement learning with PPO for alignment
Bypassed human-annotated data with VLM-based rewards
🔎 Similar Papers
No similar papers found.