What Matters in Detecting AI-Generated Videos like Sora?

📅 2024-06-27
🏛️ arXiv.org
📈 Citations: 12
Influential: 1
📄 PDF
🤖 AI Summary
This study quantifies the fundamental 3D perceptual gap between AI-generated videos (e.g., Sora) and real-world videos. Addressing appearance, motion, and geometry—three orthogonal dimensions of 3D visual understanding—we systematically characterize 3D inconsistencies in generated videos for the first time and propose a multi-cue detection framework based on 3D CNNs. Methodologically, we integrate features from vision foundation models, optical flow, monocular depth estimation, and Grad-CAM-based interpretability analysis within an “Ensemble-of-Experts” detection paradigm, significantly enhancing cross-model generalization. Experiments demonstrate that individual unimodal detectors achieve high performance; critically, the ensemble model maintains strong accuracy on unseen Sora samples, confirming the existence of a stable, transferable 3D discriminative gap between real and synthetic videos.

Technology Category

Application Category

📝 Abstract
Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI-generated videos' 3D world simulation capabilities objectively
Assessing 3D visual consistency in synthetic videos without manual annotations
Quantifying the gap between real and AI-generated 3D video quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned 3D Evaluation assesses AI video 3D simulation
Uses 3D convolutional network trained on monocular cues
Quantifies gap between real and synthetic video coherence
🔎 Similar Papers
No similar papers found.