OpenSDI: Spotting Diffusion-Generated Images in the Open World

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address the challenge of detecting and localizing diffusion-generated images in open-world settings, this paper introduces OpenSDID—the first benchmark dataset supporting both global and local manipulation identification—and proposes the Synergizing Pretrained Models (SPM) paradigm. Based on SPM, we design MaskCLIP, the first model to jointly leverage CLIP’s cross-modal alignment capability and MAE’s structured reconstruction capacity, augmented with prompt-driven guidance and cross-modal attention coordination to enhance cross-domain generalization. On OpenSDID, MaskCLIP achieves state-of-the-art performance: +2.05% accuracy and +2.38% F1-score for detection, and +14.23% IoU and +14.11% F1-score for localization—substantially outperforming existing methods.

Technology Category

Application Category

📝 Abstract

This paper identifies OpenSDI, a challenge for spotting diffusion-generated images in open-world settings. In response to this challenge, we define a new benchmark, the OpenSDI dataset (OpenSDID), which stands out from existing datasets due to its diverse use of large vision-language models that simulate open-world diffusion-based manipulations. Another outstanding feature of OpenSDID is its inclusion of both detection and localization tasks for images manipulated globally and locally by diffusion models. To address the OpenSDI challenge, we propose a Synergizing Pretrained Models (SPM) scheme to build up a mixture of foundation models. This approach exploits a collaboration mechanism with multiple pretrained foundation models to enhance generalization in the OpenSDI context, moving beyond traditional training by synergizing multiple pretrained models through prompting and attending strategies. Building on this scheme, we introduce MaskCLIP, an SPM-based model that aligns Contrastive Language-Image Pre-Training (CLIP) with Masked Autoencoder (MAE). Extensive evaluations on OpenSDID show that MaskCLIP significantly outperforms current state-of-the-art methods for the OpenSDI challenge, achieving remarkable relative improvements of 14.23% in IoU (14.11% in F1) and 2.05% in accuracy (2.38% in F1) compared to the second-best model in localization and detection tasks, respectively. Our dataset and code are available at https://github.com/iamwangyabin/OpenSDI.

Problem

Research questions and friction points this paper is trying to address.

Detecting diffusion-generated images in open-world settings

Localizing globally and locally manipulated diffusion images

Improving generalization using multiple pretrained foundation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synergizing Pretrained Models (SPM) scheme

MaskCLIP aligns CLIP with MAE

Collaboration mechanism with foundation models

🔎 Similar Papers

Diffusion Noise Feature: Accurate and Fast Generated Image Detection

2023-12-05arXiv.orgCitations: 14

Bosch Group

Renningen, BW, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)