Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

📅 2024-12-16
🏛️ Computer Vision and Pattern Recognition
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Existing DeepFake detectors overly rely on facial regions, limiting their effectiveness against full-frame manipulations and text-to-video (T2V) or image-to-video (I2V) entirely synthetic videos. To address this, we propose UNITE, a universal video forgery detector that breaks the face-centric paradigm. UNITE establishes the first unified detection framework covering facial manipulation, background editing, and end-to-end generative content. It introduces an Attention Diversity (AD) loss to explicitly suppress facial bias and enhance spatial attention generalization. Leveraging domain-agnostic video features extracted by SigLIP-So400M and a Transformer-based architecture, UNITE jointly optimizes the AD loss and cross-entropy loss using heterogeneous multi-source data. Evaluated across diverse benchmarks—including facial tampering, background editing, and T2V/I2V synthetic videos—UNITE consistently outperforms state-of-the-art methods, demonstrating superior cross-scenario adaptability and generalization performance.

Technology Category

Application Category

📝 Abstract
Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches.To address this, we introduce the Universal Network for Identifying Tampered and synthEtic videos (UNITE) model, which, unlike traditional detectors, captures full-frame manipulations. UNITE extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model’s tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that UNITE outperforms state-of-the-art detectors on datasets featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.
Problem

Research questions and friction points this paper is trying to address.

Detecting fully AI-generated synthetic videos beyond facial manipulations
Addressing limitations of face-centric methods in video forensics
Generalizing detection to non-human subjects and background alterations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer architecture processes domain-agnostic video features
Attention-diversity loss prevents over-focusing on facial regions
Integrates task-irrelevant data with standard DeepFake datasets
🔎 Similar Papers
No similar papers found.
R
Rohit Kundu
Google, Mountain View, USA
H
Hao Xiong
Google, Mountain View, USA
V
Vishal Mohanty
Google, Mountain View, USA
Athula Balachandran
Athula Balachandran
Google
Amit K. Roy-Chowdhury
Amit K. Roy-Chowdhury
Professor and UC Presidential Chair, UC Riverside; Fellow IEEE, IAPR
Computer VisionStatistical LearningImage ProcessingCamera Networks