Unlocking the Capabilities of Vision-Language Models for Generalizable and Explainable Deepfake Detection

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) suffer from misalignment between their learned knowledge and evolving deepfake patterns, resulting in limited generalizability and poor interpretability for deepfake detection. To address these limitations, we propose a knowledge-guided VLM unlocking paradigm comprising three key innovations: (1) a knowledge-enhanced embedding module that aligns forgery-specific features via semantic grounding; (2) a multimodal prompt-tuning framework supporting multi-turn interactive reasoning; and (3) an iterative evidence-reasoning mechanism that synergistically couples VLMs with large language models (LLMs). Our approach integrates contrastive learning, prompt tuning, and cross-modal alignment to jointly optimize visual and linguistic representations. Extensive experiments demonstrate state-of-the-art performance across five major benchmarks—FF++, CDF2, DFD, DFDCP, and DFDC—while achieving strong cross-domain generalization, pixel-level localization accuracy, and natural-language explanations for detection decisions.

Technology Category

Application Category

📝 Abstract
Current vision-language models (VLMs) have demonstrated remarkable capabilities in understanding multimodal data, but their potential remains underexplored for deepfake detection due to the misaligned of their knowledge and forensics patterns. To this end, we present a novel paradigm that unlocks VLMs' potential capabilities through three components: (1) A knowledge-guided forgery adaptation module that aligns VLM's semantic space with forensic features through contrastive learning with external manipulation knowledge; (2) A multi-modal prompt tuning framework that jointly optimizes visual-textual embeddings for both localization and explainability; (3) An iterative refinement strategy enabling multi-turn dialog for evidence-based reasoning. Our framework includes a VLM-based Knowledge-guided Forgery Detector (KFD), a VLM image encoder, and a Large Language Model (LLM). The VLM image encoder extracts visual prompt embeddings from images, while the LLM receives visual and question prompt embeddings for inference. The KFD is used to calculate correlations between image features and pristine/deepfake class embeddings, enabling forgery classification and localization. The outputs from these components are used to construct forgery prompt embeddings. Finally, we feed these prompt embeddings into the LLM to generate textual detection responses to assist judgment. Extensive experiments on multiple benchmarks, including FF++, CDF2, DFD, DFDCP, and DFDC, demonstrate that our scheme surpasses state-of-the-art methods in generalization performance, while also supporting multi-turn dialogue capabilities.
Problem

Research questions and friction points this paper is trying to address.

Enhance deepfake detection using vision-language models.
Align VLM semantic space with forensic features.
Enable explainable and generalizable deepfake detection.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge-guided forgery adaptation aligns semantic space.
Multi-modal prompt tuning optimizes visual-textual embeddings.
Iterative refinement enables evidence-based reasoning dialogue.
🔎 Similar Papers
No similar papers found.