ProGuard: Towards Proactive Multimodal Safeguard

📅 2025-12-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of proactively identifying and explaining out-of-distribution (OOD) safety risks in generative multimodal models, this paper proposes the first fine-tuning-free vision-language proactive safety framework. Methodologically, it introduces: (1) a modality-balanced 87K multimodal safety dataset; (2) a proactive OOD safety category inference task, augmented by a synonym-lexicon-driven semantic similarity reward mechanism to enable interpretable detection and concise natural-language descriptions of unseen risk categories; and (3) end-to-end training via reinforcement learning guided by a hierarchical multimodal safety classification taxonomy. Experiments demonstrate substantial improvements: +52.6% in OOD risk detection accuracy and +64.8% in description accuracy. The framework matches closed-source large models on binary safety classification and significantly surpasses them in fine-grained safety categorization under open-weight settings.

Technology Category

Application Category

📝 Abstract
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
Problem

Research questions and friction points this paper is trying to address.

Proposes ProGuard for proactive multimodal safety risk identification
Addresses limitations of reactive methods without model adjustments
Enhances out-of-distribution risk detection and description capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proactive multimodal guard identifies OOD risks without model adjustments
Modality-balanced dataset with hierarchical safety taxonomy mitigates bias
Reinforcement learning with synonym-based reward enhances OOD risk description
🔎 Similar Papers
No similar papers found.
S
Shaohan Yu
Shanghai Artificial Intelligence Laboratory
L
Lijun Li
Shanghai Artificial Intelligence Laboratory
C
Chenyang Si
PRLab Nanjing University
Lu Sheng
Lu Sheng
School of Software, Beihang University
Embodied AI3D VisionMachine Learning
Jing Shao
Jing Shao
Research Scientist, Shanghai AI Laboratory/Shanghai Jiao Tong University
Computer VisionMulti-Modal Large Language Model