Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model

📅 2024-04-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

To address poor generalization to unseen deepfake types, heavy reliance on labeled data, and insufficient model robustness in video deepfake detection, this paper proposes a lightweight side-decoder network leveraging the CLIP image encoder. Methodologically, we introduce a novel spatiotemporal feature decoding architecture that jointly models inter-frame temporal dynamics and intra-frame spatial cues, augmented by a facial component guidance (FCG) mechanism that employs attention to focus on discriminative facial regions. Compared to state-of-the-art approaches, our method significantly improves generalization: it achieves SOTA cross-dataset generalization performance on multi-source deepfake benchmarks, reduces training data requirements by 40%, cuts model parameters by 35%, and boosts accuracy under adversarial perturbations by 12%. Our core contributions are the first introduction of a side-network decoding paradigm and the FCG-guided strategy—jointly enabling high efficiency, strong robustness, and effective cross-domain adaptability.

Technology Category

Application Category

📝 Abstract

Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.

Problem

Research questions and friction points this paper is trying to address.

Improve generalizability of deepfake detection using foundation models

Enhance spatial learning by focusing on key facial regions

Develop efficient, robust models for detecting unseen deepfake samples

Innovation

Methods, ideas, or system contributions that make the work stand out.

Side-network decoder extracts spatial-temporal cues

Facial Component Guidance enhances spatial learning

Leverages vision-language model for generalizability

🔎 Similar Papers

No similar papers found.