Video as Natural Augmentation: Towards Unified AI-Generated Image and Video Detection

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current AI-generated image detectors exhibit significantly degraded generalization when applied to video frames. To address this limitation, this work proposes VINA, a novel framework that treats video frames as physically plausible natural augmentations of images, thereby enabling a unified image–video AIGC detection model. By leveraging joint training and cross-modal supervised contrastive learning, VINA aligns feature representations of images and videos along the decision boundary between real and synthetic content. Notably, the approach achieves state-of-the-art performance across 14 diverse benchmarks—spanning image, video, and real-world scenarios—without relying on complex data augmentation or modality-specific fine-tuning. This results in markedly improved robustness, transferability, and bidirectional detection capability.

📝 Abstract

AI-generated content (AIGC) is rapidly improving, creating an urgent need for detectors that generalize across data sources, deployment pipelines, and visual modalities. A strongly generalizable detector should remain robust under distributional variations. However, we identify a consistent failure mode: SOTA AI-generated image detectors often collapse when applied to frames extracted from videos. Through systematic analysis, we show that this cross-modal gap arises from both entangled synthesis-agnostic video processing shifts, including color conversion, codec compression, resizing, and blur, and model-specific fingerprints introduced by modern video generators. Motivated by these findings, we propose VINA (Video as Natural Augmentation), a unified AIGC detection framework that jointly trains on image and video data. VINA uses video frames as physically grounded natural augmentations and further introduces a cross-modal supervised contrastive objective to align image and video representations under a shared real/fake decision boundary. Extensive experiments on 14 image, video, and in-the-wild benchmarks show that VINA delivers bidirectional gains, improves robustness and transferability, and achieves state-of-the-art performance across nearly all evaluated settings without complex augmentation or dataset-specific tuning.

Problem

Research questions and friction points this paper is trying to address.

AI-generated content detection

cross-modal generalization

video forensics

image and video detection

distributional shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Video as Natural Augmentation

cross-modal detection

supervised contrastive learning

AIGC detection

unified framework

🔎 Similar Papers

Detecting AI-Generated Video via Frame Consistency

2024-02-03Citations: 1