Leveraging Pre-Trained Visual Models for AI-Generated Video Detection

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing AI-generated video detection methods are largely confined to deepfake face manipulation and struggle to generalize to increasingly realistic text-to-video (T2V) models. Method: We propose a zero-shot/low-shot detection framework leveraging large-scale pretrained vision models (e.g., CLIP), requiring no model fine-tuning. It extracts frame-level visual features, applies temporal pooling, and employs a lightweight linear classifier to capture intrinsic discriminative signals between real and synthetic videos. Contribution/Results: Evaluated on VID-AID—a newly constructed, large-scale benchmark comprising over 10,000 videos spanning nine state-of-the-art T2V models—the framework achieves >90% average detection accuracy. It significantly transcends face-centric detection paradigms, demonstrating strong generalization across diverse generative architectures and prompting modalities. This work provides a scalable, model-agnostic technical foundation for misinformation mitigation, privacy preservation, and multimedia content security.

Technology Category

Application Category

📝 Abstract

Recent advances in Generative AI (GenAI) have led to significant improvements in the quality of generated visual content. As AI-generated visual content becomes increasingly indistinguishable from real content, the challenge of detecting the generated content becomes critical in combating misinformation, ensuring privacy, and preventing security threats. Although there has been substantial progress in detecting AI-generated images, current methods for video detection are largely focused on deepfakes, which primarily involve human faces. However, the field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content. To address this gap, we propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos. The features extracted from these pre-trained models, which have been trained on extensive real visual content, contain inherent signals that can help distinguish real from generated videos. Using these extracted features, we achieve high detection performance without requiring additional model training, and we further improve performance by training a simple linear classification layer on top of the extracted features. We validated our method on a dataset we compiled (VID-AID), which includes around 10,000 AI-generated videos produced by 9 different text-to-video models, along with 4,000 real videos, totaling over 7 hours of video content. Our evaluation shows that our approach achieves high detection accuracy, above 90% on average, underscoring its effectiveness. Upon acceptance, we plan to publicly release the code, the pre-trained models, and our dataset to support ongoing research in this critical area.

Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated videos beyond deepfakes

Distinguish real from generated generic video content

Combat misinformation using pre-trained visual models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages pre-trained visual models

Extracts features for detection

Uses linear classification layer

🔎 Similar Papers

No similar papers found.