π€ AI Summary
Detecting diffusion-generated images remains challenging, especially for unseen generative models not encountered during training. Method: This paper proposes a robust detection framework grounded in universal noise characteristics inherent to diffusion processes. Its core innovations are the Noise-Aware Self-Attention (NASA) module and the NASA-Swin detection architecture: NASA introduces noise-guided anomalous attention to enable cross-modal fusion of RGB and noise-domain features, while a channel-wise masking strategy enhances discriminability. Crucially, the method requires no model-specific priorsβonly the residual noise naturally preserved in diffusion outputs serves as the universal detection cue. Contribution/Results: The approach achieves state-of-the-art performance on cross-model generalization benchmarks, significantly improving detection accuracy and robustness against images from unknown diffusion models. Experimental validation confirms that residual noise constitutes a viable and effective universal signal for generalized synthetic image detection.
π Abstract
With the rapid development of image generation technologies, especially the advancement of Diffusion Models, the quality of synthesized images has significantly improved, raising concerns among researchers about information security. To mitigate the malicious abuse of diffusion models, diffusion-generated image detection has proven to be an effective countermeasure. However, a key challenge for forgery detection is generalising to diffusion models not seen during training. In this paper, we address this problem by focusing on image noise. We observe that images from different diffusion models share similar noise patterns, distinct from genuine images. Building upon this insight, we introduce a novel Noise-Aware Self-Attention (NASA) module that focuses on noise regions to capture anomalous patterns. To implement a SOTA detection model, we incorporate NASA into Swin Transformer, forming an novel detection architecture NASA-Swin. Additionally, we employ a cross-modality fusion embedding to combine RGB and noise images, along with a channel mask strategy to enhance feature learning from both modalities. Extensive experiments demonstrate the effectiveness of our approach in enhancing detection capabilities for diffusion-generated images. When encountering unseen generation methods, our approach achieves the state-of-the-art performance.