🤖 AI Summary
Existing image editing detection methods suffer from coarse-grained localization, reliance on costly pixel-level annotations, and a lack of high-quality benchmark datasets. To address these limitations, this work introduces FragFake—the first large-scale, fine-grained benchmark dataset specifically designed for local editing detection—and pioneers the integration of vision-language models (VLMs) into this task by reformulating editing detection as a vision-language understanding problem. We propose a fully automated image editing synthesis pipeline capable of generating diverse, multi-scale edits with precise region-level annotations. Leveraging fine-tuned VLMs—including BLIP-2 and Qwen-VL—we achieve significant improvements in Object Precision over pre-trained baselines. Ablation studies and cross-scenario transfer experiments demonstrate the robustness and generalizability of our approach. This work establishes a new paradigm, provides a novel high-quality dataset, and introduces effective VLM-based models for fine-grained editing detection.
📝 Abstract
Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.