What Matters in Detecting AI-Generated Videos like Sora?

📅 2024-06-27

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 1

career value

185K/year

🤖 AI Summary

This study quantifies the fundamental 3D perceptual gap between AI-generated videos (e.g., Sora) and real-world videos. Addressing appearance, motion, and geometry—three orthogonal dimensions of 3D visual understanding—we systematically characterize 3D inconsistencies in generated videos for the first time and propose a multi-cue detection framework based on 3D CNNs. Methodologically, we integrate features from vision foundation models, optical flow, monocular depth estimation, and Grad-CAM-based interpretability analysis within an “Ensemble-of-Experts” detection paradigm, significantly enhancing cross-model generalization. Experiments demonstrate that individual unimodal detectors achieve high performance; critically, the ensemble model maintains strong accuracy on unseen Sora samples, confirming the existence of a stable, transferable 3D discriminative gap between real and synthetic videos.

Technology Category

Application Category

📝 Abstract

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI-generated videos' 3D world simulation capabilities objectively

Assessing 3D visual consistency in synthetic videos without manual annotations

Quantifying the gap between real and AI-generated 3D video quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learned 3D Evaluation assesses AI video 3D simulation

Uses 3D convolutional network trained on monocular cues

Quantifies gap between real and synthetic video coherence

🔎 Similar Papers

Detecting AI-Generated Video via Frame Consistency

2024-02-03Citations: 1

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence