🤖 AI Summary
Current AI-generated image detectors exhibit significantly degraded generalization when applied to video frames. To address this limitation, this work proposes VINA, a novel framework that treats video frames as physically plausible natural augmentations of images, thereby enabling a unified image–video AIGC detection model. By leveraging joint training and cross-modal supervised contrastive learning, VINA aligns feature representations of images and videos along the decision boundary between real and synthetic content. Notably, the approach achieves state-of-the-art performance across 14 diverse benchmarks—spanning image, video, and real-world scenarios—without relying on complex data augmentation or modality-specific fine-tuning. This results in markedly improved robustness, transferability, and bidirectional detection capability.
📝 Abstract
AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.