🤖 AI Summary
This work systematically investigates the vulnerability of reconstruction-based detectors to imperceptible adversarial perturbations, revealing a dramatic degradation in their ability to detect images generated by diffusion models. Under both white-box and black-box settings, the study demonstrates for the first time that such detectors are highly sensitive to adversarial attacks, with detection accuracy dropping to near zero. The performance collapse is primarily attributed to low signal-to-noise ratios in the reconstructed features. Through white-box attacks, cross-model transfer experiments, and evaluations of standard defense mechanisms, the research shows that existing defenses offer limited protection. These findings expose fundamental security flaws in current reconstruction-based detection paradigms and highlight the strong transferability of adversarial attacks across different detectors.
📝 Abstract
Recently, detecting AI-generated images produced by diffusion-based models has attracted increasing attention due to their potential threat to safety. Among existing approaches, reconstruction-based methods have emerged as a prominent paradigm for this task. However, we find that such methods exhibit severe security vulnerabilities to adversarial perturbations; that is, by adding imperceptible adversarial perturbations to input images, the detection accuracy of classifiers collapses to near zero. To verify this threat, we present a systematic evaluation of the adversarial robustness of three representative detectors across four diverse generative backbone models. First, we construct adversarial attacks in white-box scenarios, which degrade the performance of all well-trained detectors. Moreover, we find that these attacks demonstrate transferability; specifically, attacks crafted against one detector can be transferred to others, indicating that adversarial attacks on detectors can also be constructed in a black-box setting. Finally, we assess common countermeasures and find that standard defense methods against adversarial attacks provide limited mitigation. We attribute these failures to the low signal-to-noise ratio (SNR) of attacked samples as perceived by the detectors. Overall, our results reveal fundamental security limitations of reconstruction-based detectors and highlight the need to rethink existing detection strategies.