Unlocking the Hidden Potential of CLIP in Generalizable Deepfake Detection

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

This paper addresses the challenges of detecting partially manipulated facial deepfakes—such as fine-grained local feature edits—and their poor generalization across datasets and forgery methods. We propose a lightweight, efficient detection method built upon CLIP-ViT-L/14. Our key contributions are threefold: (1) the first systematic investigation of CLIP’s visual encoder for fine-grained facial manipulation detection; (2) LN-tuning—a parameter-efficient fine-tuning strategy integrating L2 normalization and hyperspherical metric learning regularization—which significantly improves zero-shot cross-dataset (Celeb-DF-v2, DFDC, FFIW) and cross-method generalization; and (3) a face-adaptive preprocessing module to enhance local sensitivity. By tuning only 0.1% of the model parameters, our approach achieves accuracy on par with or surpassing complex state-of-the-art methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

This paper tackles the challenge of detecting partially manipulated facial deepfakes, which involve subtle alterations to specific facial features while retaining the overall context, posing a greater detection difficulty than fully synthetic faces. We leverage the Contrastive Language-Image Pre-training (CLIP) model, specifically its ViT-L/14 visual encoder, to develop a generalizable detection method that performs robustly across diverse datasets and unknown forgery techniques with minimal modifications to the original model. The proposed approach utilizes parameter-efficient fine-tuning (PEFT) techniques, such as LN-tuning, to adjust a small subset of the model's parameters, preserving CLIP's pre-trained knowledge and reducing overfitting. A tailored preprocessing pipeline optimizes the method for facial images, while regularization strategies, including L2 normalization and metric learning on a hyperspherical manifold, enhance generalization. Trained on the FaceForensics++ dataset and evaluated in a cross-dataset fashion on Celeb-DF-v2, DFDC, FFIW, and others, the proposed method achieves competitive detection accuracy comparable to or outperforming much more complex state-of-the-art techniques. This work highlights the efficacy of CLIP's visual encoder in facial deepfake detection and establishes a simple, powerful baseline for future research, advancing the field of generalizable deepfake detection. The code is available at: https://github.com/yermandy/deepfake-detection

Problem

Research questions and friction points this paper is trying to address.

Detecting partially manipulated facial deepfakes with subtle alterations

Leveraging CLIP for generalizable deepfake detection across diverse datasets

Using parameter-efficient fine-tuning to preserve pre-trained knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes CLIP's ViT-L/14 visual encoder

Employs parameter-efficient fine-tuning (PEFT)

Enhances generalization with regularization strategies

🔎 Similar Papers

Guided and Fused: Efficient Frozen CLIP-ViT with Feature Guidance and Multi-Stage Feature Fusion for Generalizable Deepfake Detection