🤖 AI Summary
To address poor generalization to unseen deepfake types, heavy reliance on labeled data, and insufficient model robustness in video deepfake detection, this paper proposes a lightweight side-decoder network leveraging the CLIP image encoder. Methodologically, we introduce a novel spatiotemporal feature decoding architecture that jointly models inter-frame temporal dynamics and intra-frame spatial cues, augmented by a facial component guidance (FCG) mechanism that employs attention to focus on discriminative facial regions. Compared to state-of-the-art approaches, our method significantly improves generalization: it achieves SOTA cross-dataset generalization performance on multi-source deepfake benchmarks, reduces training data requirements by 40%, cuts model parameters by 35%, and boosts accuracy under adversarial perturbations by 12%. Our core contributions are the first introduction of a side-network decoding paradigm and the FCG-guided strategy—jointly enabling high efficiency, strong robustness, and effective cross-domain adaptability.
📝 Abstract
Generative models have enabled the creation of highly realistic facial-synthetic images, raising significant concerns due to their potential for misuse. Despite rapid advancements in the field of deepfake detection, developing efficient approaches to leverage foundation models for improved generalizability to unseen forgery samples remains challenging. To address this challenge, we propose a novel side-network-based decoder that extracts spatial and temporal cues using the CLIP image encoder for generalized video-based Deepfake detection. Additionally, we introduce Facial Component Guidance (FCG) to enhance spatial learning generalizability by encouraging the model to focus on key facial regions. By leveraging the generic features of a vision-language foundation model, our approach demonstrates promising generalizability on challenging Deepfake datasets while also exhibiting superiority in training data efficiency, parameter efficiency, and model robustness.