FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models

📅 2025-05-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing image editing detection methods suffer from coarse-grained localization, reliance on costly pixel-level annotations, and a lack of high-quality benchmark datasets. To address these limitations, this work introduces FragFake—the first large-scale, fine-grained benchmark dataset specifically designed for local editing detection—and pioneers the integration of vision-language models (VLMs) into this task by reformulating editing detection as a vision-language understanding problem. We propose a fully automated image editing synthesis pipeline capable of generating diverse, multi-scale edits with precise region-level annotations. Leveraging fine-tuned VLMs—including BLIP-2 and Qwen-VL—we achieve significant improvements in Object Precision over pre-trained baselines. Ablation studies and cross-scenario transfer experiments demonstrate the robustness and generalizability of our approach. This work establishes a new paradigm, provides a novel high-quality dataset, and introduces effective VLM-based models for fine-grained editing detection.

Technology Category

Application Category

📝 Abstract
Fine-grained edited image detection of localized edits in images is crucial for assessing content authenticity, especially given that modern diffusion models and image editing methods can produce highly realistic manipulations. However, this domain faces three challenges: (1) Binary classifiers yield only a global real-or-fake label without providing localization; (2) Traditional computer vision methods often rely on costly pixel-level annotations; and (3) No large-scale, high-quality dataset exists for modern image-editing detection techniques. To address these gaps, we develop an automated data-generation pipeline to create FragFake, the first dedicated benchmark dataset for edited image detection, which includes high-quality images from diverse editing models and a wide variety of edited objects. Based on FragFake, we utilize Vision Language Models (VLMs) for the first time in the task of edited image classification and edited region localization. Experimental results show that fine-tuned VLMs achieve higher average Object Precision across all datasets, significantly outperforming pretrained models. We further conduct ablation and transferability analyses to evaluate the detectors across various configurations and editing scenarios. To the best of our knowledge, this work is the first to reformulate localized image edit detection as a vision-language understanding task, establishing a new paradigm for the field. We anticipate that this work will establish a solid foundation to facilitate and inspire subsequent research endeavors in the domain of multimodal content authenticity.
Problem

Research questions and friction points this paper is trying to address.

Detect localized image edits for content authenticity assessment
Overcome lack of large-scale datasets for modern editing detection
Utilize Vision Language Models for edit classification and localization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline creates FragFake benchmark dataset
Vision Language Models for edit detection
Reformulates detection as vision-language task
🔎 Similar Papers
No similar papers found.
Zhen Sun
Zhen Sun
DSA Thrust, HKUST(GZ)
LLM security
Z
Ziyi Zhang
Hong Kong University of Science and Technology (Guangzhou)
Zeren Luo
Zeren Luo
DSA Thrust, The Hong Kong University of Science and Technology (Guangzhou)
Zeyang Sha
Zeyang Sha
Ant Group
computer science and security
Tianshuo Cong
Tianshuo Cong
Tsinghua Shuimu Postdoctoral Scholar
CryptographyDeep learningComputer security
Z
Zheng Li
Shandong University
S
Shiwen Cui
Ant Group
W
Weiqiang Wang
Ant Group
J
Jiaheng Wei
Hong Kong University of Science and Technology (Guangzhou)
Xinlei He
Xinlei He
Assistant Professor, HKUST(GZ)
Trustworthy Machine LearningSecurityPrivacy
Q
Qi Li
Tsinghua University
Q
Qian Wang
Wuhan University