UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

📅 2024-05-06

🏛️ arXiv.org

📈 Citations: 25

✨ Influential: 3

career value

192K/year

🤖 AI Summary

Existing image safety classifiers suffer severe performance degradation on AI-generated images and lack a unified evaluation benchmark. To address this, we introduce UnsafeBench—the first comprehensive benchmark for safety classification on both real and AI-generated images—comprising 10K images annotated across 11 unsafe semantic categories, and systematically evaluating five mainstream classifiers and three vision-language model (VLM)-based approaches. We quantitatively demonstrate, for the first time, that distributional shift between real and AI-generated images significantly impairs safety classification, causing 23%–41% performance drops. Further, we propose PerspectiveVision, a novel auditing tool balancing efficacy and robustness, which integrates VLM-enhanced feature representation with fine-grained human annotation. PerspectiveVision improves unsafe recall by 17.2%, particularly enhancing detection capability for synthetic content.

Technology Category

Application Category

📝 Abstract

With the advent of text-to-image models and concerns about their misuse, developers are increasingly relying on image safety classifiers to moderate their generated unsafe images. Yet, the performance of current image safety classifiers remains unknown for both real-world and AI-generated images. In this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers, with a particular focus on the impact of AI-generated images on their performance. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough to mitigate the multifaceted problem of unsafe images. Also, there exists a distribution shift between real-world and AI-generated images in image qualities, styles, and layouts, leading to degraded effectiveness and robustness. Motivated by these findings, we build a comprehensive image moderation tool called PerspectiveVision, which addresses the main drawbacks of existing classifiers with improved effectiveness and robustness, especially on AI-generated images. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.

Problem

Research questions and friction points this paper is trying to address.

Evaluating image safety classifier performance on real-world and AI-generated images

Assessing robustness gaps due to distribution shifts between image types

Developing comprehensive moderation tools for unsafe image detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

UnsafeBench benchmarking framework evaluates classifiers

PerspectiveVision tool improves classifier robustness

Focus on AI-generated image distribution shift

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?