🤖 AI Summary
This study addresses the growing threat of disinformation and identity theft posed by increasingly photorealistic AI-generated images by systematically evaluating, for the first time, the suitability of Vision Mamba architectures for detecting such synthetic content. Through comprehensive benchmarking across multiple generative image datasets, the work compares various Vision Mamba variants against established baselines including CNNs, Vision Transformers, and vision-language models, assessing their performance in terms of accuracy, computational efficiency, and generalization capability. Experimental results reveal that Vision Mamba exhibits distinctive potential for detection tasks, particularly excelling in computational efficiency; however, it still lags behind in cross-model generalization and detection accuracy. These findings offer valuable insights for the future design of efficient and robust systems capable of identifying forged imagery.
📝 Abstract
In recent years, computer vision has witnessed remarkable progress, fueled by the development of innovative architectures such as Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), diffusion-based architectures, Vision Transformers (ViTs), and, more recently, Vision-Language Models (VLMs). This progress has undeniably contributed to creating increasingly realistic and diverse visual content. However, such advancements in image generation also raise concerns about potential misuse in areas such as misinformation, identity theft, and threats to privacy and security. In parallel, Mamba-based architectures have emerged as versatile tools for a range of image analysis tasks, including classification, segmentation, medical imaging, object detection, and image restoration, in this rapidly evolving field. However, their potential for identifying AI-generated images remains relatively unexplored compared to established techniques. This study provides a systematic evaluation and comparative analysis of Vision Mamba models for AI-generated image detection. We benchmark multiple Vision Mamba variants against representative CNNs, ViTs, and VLM-based detectors across diverse datasets and synthetic image sources, focusing on key metrics such as accuracy, efficiency, and generalizability across diverse image types and generative models. Through this comprehensive analysis, we aim to elucidate Vision Mamba's strengths and limitations relative to established methodologies in terms of applicability, accuracy, and efficiency in detecting AI-generated images. Overall, our findings highlight both the promise and current limitations of Vision Mamba as a component in systems designed to distinguish authentic from AI-generated visual content. This research is crucial for enhancing detection in an age where distinguishing between real and AI-generated content is a major challenge.