🤖 AI Summary
To address security and compliance risks posed by image/video data and generative models, this paper introduces the first open-source, vision-language model (VLM)-driven safety framework. Methodologically, it proposes a multi-task joint modeling architecture integrating expert-annotated fine-tuning, semantics-enhanced data augmentation, and configurable classification and reasoning modules. Key contributions include: (1) releasing a high-quality multimodal safety dataset featuring fine-grained categories, expert-provided risk scores, and attribution rationales; (2) designing a lightweight adaptation strategy enabling efficient deployment of VLMs ranging from 0.5B to 7B parameters; and (3) achieving state-of-the-art accuracy in safety evaluation—already deployed for automated annotation of hundreds of millions of images and real-time content moderation in text-to-image generation systems. All code, datasets, and model weights are publicly released.
📝 Abstract
This paper introduces LlavaGuard, a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, LlavaGuard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate LlavaGuard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework publicly available, including the dataset and model weights.