Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

200K/year
🤖 AI Summary
Current AI-generated image detectors exhibit significantly degraded generalization when applied to video frames. To address this limitation, this work proposes VINA, a novel framework that treats video frames as physically plausible natural augmentations of images, thereby enabling a unified image–video AIGC detection model. By leveraging joint training and cross-modal supervised contrastive learning, VINA aligns feature representations of images and videos along the decision boundary between real and synthetic content. Notably, the approach achieves state-of-the-art performance across 14 diverse benchmarks—spanning image, video, and real-world scenarios—without relying on complex data augmentation or modality-specific fine-tuning. This results in markedly improved robustness, transferability, and bidirectional detection capability.
📝 Abstract
AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.
Problem

Research questions and friction points this paper is trying to address.

AI-generated content detection
cross-modal generalization
video forensics
image and video detection
distributional shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video as Natural Augmentation
cross-modal detection
supervised contrastive learning
AIGC detection
unified framework
🔎 Similar Papers