Fake It Until You Break It: On the Adversarial Robustness of AI-generated Image Detectors

📅 2024-10-02
🏛️ arXiv.org
📈 Citations: 3
Influential: 1
📄 PDF
🤖 AI Summary
This study addresses the insufficient adversarial robustness of AI-generated image detectors in real-world settings, where they are vulnerable to black-box attacks and common social media degradations (e.g., JPEG compression, resizing, color distortion), enabling malicious misuse for disinformation and undermining democratic trust. We conduct the first systematic evaluation of mainstream detectors under combined black-box adversarial perturbations and realistic degradations, revealing that state-of-the-art models suffer over 40% accuracy degradation without model access. To mitigate this, we propose a lightweight CLIP-enhanced defense grounded in zero-shot detection and black-box transfer attack modeling—requiring no retraining or fine-tuning. Our method preserves original detection performance while reducing adversarial success rates by 76%, substantially restoring practical utility and robustness on real platforms. This work delivers a deployable, trustworthy solution for AI-generated content authentication.

Technology Category

Application Category

📝 Abstract
While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.
Problem

Research questions and friction points this paper is trying to address.

AI-generated image detectors lack robustness against adversarial attacks
Current classifiers fail under real-world conditions and image degradation
Commercial GenAI detection tools are vulnerable to black-box attacks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Demonstrates vulnerability of AI image detectors
Tests four detectors with five attack algorithms
Evaluates robustness using pre-trained model features
🔎 Similar Papers
No similar papers found.
S
Sina Mavali
CISPA Helmholtz Center for Information Security
J
Jonas Ricker
Ruhr University Bochum
D
David Pape
CISPA Helmholtz Center for Information Security
Y
Yash Sharma
University of Tübingen
Asja Fischer
Asja Fischer
Professor for Machine Learning, Ruhr University Bochum
machine learningdeep learningprobabilistic models
Lea Schönherr
Lea Schönherr
CISPA Helmholtz Center for Information Security
Trustworthy MLTrustworthy Generative AIComputer Security