Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

📅 2025-09-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI-generated image detection models lack rigorous evaluation of generalization and robustness under real-world conditions. Method: We introduce RRDataset—the first benchmark explicitly designed for realistic challenges—encompassing seven representative scenarios, multi-round social media propagation, and diverse re-digitization operations (e.g., compression, format conversion, screen recapture). We propose the first comprehensive evaluation framework jointly assessing scene generalization, internet transmission robustness, and re-digitization robustness, systematically evaluating 17 detectors and 10 vision-language models. Contribution/Results: Experiments reveal substantial performance degradation of state-of-the-art methods along authentic dissemination chains, while humans demonstrate superior adaptability under few-shot conditions. Our large-scale human-machine comparative study quantifies critical technical bottlenecks and provides empirical foundations for designing robust detection algorithms and human-AI collaborative mechanisms.

Technology Category

Application Category

📝 Abstract
With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking AI-generated image detection in real-world scenarios
Evaluating detector robustness under internet transmission distortions
Assessing model performance across diverse content domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-World Robustness Dataset for evaluation
Testing across scenario generalization and transmission
Benchmarking detectors and vision-language models
🔎 Similar Papers
No similar papers found.
Chunxiao Li
Chunxiao Li
University of Science and Technology of China
User BehaviorFintechFinancial InclusionPlatform Economy
Xiaoxiao Wang
Xiaoxiao Wang
University of Science and Technology of China
fMRIWhite MatterVisualDeep LearningTaste
M
Meiling Li
Fudan University, Shanghai, China
B
Boming Miao
Beijing Normal University, Beijing, China
P
Peng Sun
Central University of Finance and Economics, Beijing, China
Y
Yunjian Zhang
Tsinghua University, Beijing, China
X
Xiangyang Ji
Tsinghua University, Beijing, China
Y
Yao Zhu
Tsinghua University, Beijing, China