SafetyPairs: Isolating Safety Critical Image Features with Counterfactual Image Generation

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing image safety datasets suffer from coarse granularity and ambiguous boundaries, hindering precise localization of critical visual features driving safety decisions. To address this, we propose the first scalable counterfactual image generation framework: leveraging controllable image editing models, it selectively modifies safety-relevant attributes while strictly preserving all non-safety features, thereby constructing image pairs differing exclusively along safety dimensions. Based on this framework, we release SafeCF-Bench—a fine-grained benchmark comprising over 3,020 image pairs spanning nine safety risk categories. This work achieves the first causal-level disentanglement of image safety factors, significantly improving training efficiency for lightweight defense models and enhancing their capability to detect subtle safety risks. Furthermore, it uncovers systematic deficiencies in multimodal models’ ability to discriminate safety-sensitive visual features.

Technology Category

Application Category

📝 Abstract

What exactly makes a particular image unsafe? Systematically differentiating between benign and problematic images is a challenging problem, as subtle changes to an image, such as an insulting gesture or symbol, can drastically alter its safety implications. However, existing image safety datasets are coarse and ambiguous, offering only broad safety labels without isolating the specific features that drive these differences. We introduce SafetyPairs, a scalable framework for generating counterfactual pairs of images, that differ only in the features relevant to the given safety policy, thus flipping their safety label. By leveraging image editing models, we make targeted changes to images that alter their safety labels while leaving safety-irrelevant details unchanged. Using SafetyPairs, we construct a new safety benchmark, which serves as a powerful source of evaluation data that highlights weaknesses in vision-language models' abilities to distinguish between subtly different images. Beyond evaluation, we find our pipeline serves as an effective data augmentation strategy that improves the sample efficiency of training lightweight guard models. We release a benchmark containing over 3,020 SafetyPair images spanning a diverse taxonomy of 9 safety categories, providing the first systematic resource for studying fine-grained image safety distinctions.

Problem

Research questions and friction points this paper is trying to address.

Isolating safety-critical image features through counterfactual generation

Addressing coarse safety labels in existing image datasets

Improving vision-language models' fine-grained safety distinction capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates counterfactual image pairs via editing

Isolates safety-critical features through targeted modifications

Creates benchmark for fine-grained safety evaluation

🔎 Similar Papers

UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images