🤖 AI Summary
This paper addresses the challenges of detecting partially manipulated facial deepfakes—such as fine-grained local feature edits—and their poor generalization across datasets and forgery methods. We propose a lightweight, efficient detection method built upon CLIP-ViT-L/14. Our key contributions are threefold: (1) the first systematic investigation of CLIP’s visual encoder for fine-grained facial manipulation detection; (2) LN-tuning—a parameter-efficient fine-tuning strategy integrating L2 normalization and hyperspherical metric learning regularization—which significantly improves zero-shot cross-dataset (Celeb-DF-v2, DFDC, FFIW) and cross-method generalization; and (3) a face-adaptive preprocessing module to enhance local sensitivity. By tuning only 0.1% of the model parameters, our approach achieves accuracy on par with or surpassing complex state-of-the-art methods. The implementation is publicly available.
📝 Abstract
This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection