AdaptPrompt: Parameter-Efficient Adaptation of VLMs for Generalizable Deepfake Detection

📅 2025-12-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the poor generalization of deepfake detection—particularly across diverse generative models—this paper proposes AdaptPrompt, a parameter-efficient generalization framework built upon a frozen CLIP backbone. It jointly optimizes learnable textual prompts and lightweight visual adapters, and introduces a novel texture-aware pruning strategy that removes the final layer of the visual encoder to enhance modeling of high-frequency forgery artifacts. We also establish Diff-Gen, the first large-scale benchmark dedicated to diffusion-model-generated forgeries, enabling systematic evaluation across GANs, diffusion models, and commercial tools. AdaptPrompt achieves state-of-the-art performance on 25 heterogeneous test sets using only 320 samples per domain, demonstrating strong cross-model generalization. Moreover, it supports closed-set generator architecture attribution. Key innovations include (i) a multimodal prompt–adapter co-optimization paradigm under frozen large-model constraints, and (ii) the texture-aware pruning strategy for improved forgery localization.

Technology Category

Application Category

📝 Abstract
Recent advances in image generation have led to the widespread availability of highly realistic synthetic media, increasing the difficulty of reliable deepfake detection. A key challenge is generalization, as detectors trained on a narrow class of generators often fail when confronted with unseen models. In this work, we address the pressing need for generalizable detection by leveraging large vision-language models, specifically CLIP, to identify synthetic content across diverse generative techniques. First, we introduce Diff-Gen, a large-scale benchmark dataset comprising 100k diffusion-generated fakes that capture broad spectral artifacts unlike traditional GAN datasets. Models trained on Diff-Gen demonstrate stronger cross-domain generalization, particularly on previously unseen image generators. Second, we propose AdaptPrompt, a parameter-efficient transfer learning framework that jointly learns task-specific textual prompts and visual adapters while keeping the CLIP backbone frozen. We further show via layer ablation that pruning the final transformer block of the vision encoder enhances the retention of high-frequency generative artifacts, significantly boosting detection accuracy. Our evaluation spans 25 challenging test sets, covering synthetic content generated by GANs, diffusion models, and commercial tools, establishing a new state-of-the-art in both standard and cross-domain scenarios. We further demonstrate the framework's versatility through few-shot generalization (using as few as 320 images) and source attribution, enabling the precise identification of generator architectures in closed-set settings.
Problem

Research questions and friction points this paper is trying to address.

Generalizable deepfake detection across diverse generative models
Parameter-efficient adaptation of vision-language models for detection
Enhancing detection accuracy by retaining high-frequency generative artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale diffusion dataset captures broad artifacts
Parameter-efficient prompt and adapter learning framework
Pruning vision encoder enhances high-frequency artifact retention
🔎 Similar Papers
Yichen Jiang
Yichen Jiang
Apple AI/ML
NLPAIMachine Learning
M
Mohammed Talha Alam
MBZUAI, UAE
S
Sohail Ahmed Khan
University of Bergen, Norway
Duc-Tien Dang-Nguyen
Duc-Tien Dang-Nguyen
University of Bergen
Multimedia ForensicsMultimedia RetrievalLifeloggingMisinformationMultimedia Verification
F
Fakhri Karray
University of Waterloo, Canada