Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

📅 2025-03-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of detecting partially manipulated facial deepfakes—such as fine-grained local feature edits—and their poor generalization across datasets and forgery methods. We propose a lightweight, efficient detection method built upon CLIP-ViT-L/14. Our key contributions are threefold: (1) the first systematic investigation of CLIP’s visual encoder for fine-grained facial manipulation detection; (2) LN-tuning—a parameter-efficient fine-tuning strategy integrating L2 normalization and hyperspherical metric learning regularization—which significantly improves zero-shot cross-dataset (Celeb-DF-v2, DFDC, FFIW) and cross-method generalization; and (3) a face-adaptive preprocessing module to enhance local sensitivity. By tuning only 0.1% of the model parameters, our approach achieves accuracy on par with or surpassing complex state-of-the-art methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection
Problem

Research questions and friction points this paper is trying to address.

Detecting partially manipulated facial deepfakes with subtle alterations
Leveraging CLIP for generalizable deepfake detection across diverse datasets
Using parameter-efficient fine-tuning to preserve pre-trained knowledge
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes CLIP's ViT-L/14 visual encoder
Employs parameter-efficient fine-tuning (PEFT)
Enhances generalization with regularization strategies