Standing on the Shoulders of Giants: Reprogramming Visual-Language Model for General Deepfake Detection

📅 2024-09-04
🏛️ Proceedings of the AAAI Conference on Artificial Intelligence
📈 Citations: 2
Influential: 1
📄 PDF
🤖 AI Summary
Deepfake detection models exhibit poor generalization across datasets and generative models. To address this, we propose an input-level reprogramming method for vision-language models (VLMs) targeting zero-shot generalization—requiring no parameter fine-tuning. Our approach jointly optimizes learnable visual perturbations on CLIP inputs and introduces face-embedding-driven, sample-specific dynamic text prompts to enable adaptive semantic alignment with forged content. This constitutes the first VLM input reprogramming paradigm, eliminating reliance on supervised fine-tuning. Evaluated on the challenging cross-domain FF++ → Wild-Deepfake benchmark, our method achieves an AUC of 88.3%, significantly outperforming prior zero-shot approaches. Crucially, it incurs negligible parameter overhead—orders of magnitude fewer trainable parameters than fine-tuned baselines—while delivering both strong generalization and computational efficiency.

Technology Category

Application Category

📝 Abstract
The proliferation of deepfake faces poses huge potential negative impacts on our daily lives. Despite substantial advancements in deepfake detection over these years, the generalizability of existing methods against forgeries from unseen datasets or created by emerging generative models remains constrained. In this paper, inspired by the zero-shot advantages of Vision-Language Models (VLMs), we propose a novel approach that repurposes a well-trained VLM for general deepfake detection. Motivated by the model reprogramming paradigm that manipulates the model prediction via input perturbations, our method can reprogram a pre-trained VLM model (e.g., CLIP) solely based on manipulating its input without tuning the inner parameters. First, learnable visual perturbations are used to refine feature extraction for deepfake detection. Then, we exploit information of face embedding to create sample-level adaptative text prompts, improving the performance. Extensive experiments on several popular benchmark datasets demonstrate that (1) the cross dataset and cross-manipulation performances of deepfake detection can be significantly and consistently improved (e.g., over 88% AUC in cross-dataset setting from FF++ to Wild-Deepfake); (2) the superior performances are achieved with fewer trainable parameters, making it a promising approach for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Improving generalizability of deepfake detection across unseen datasets
Reprogramming Vision-Language Models for zero-shot deepfake detection
Enhancing detection performance with minimal trainable parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reprogramming Vision-Language Model for detection
Input perturbations without tuning inner parameters
Sample-level adaptive text prompts improve performance
🔎 Similar Papers
No similar papers found.