ExDDV: A New Dataset for Explainable Deepfake Detection in Video

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
As generative video quality improves, deepfakes become increasingly indistinguishable from authentic content to human observers, while existing detectors suffer from poor interpretability and high error rates. To address this, we propose ExDDV—the first explainable benchmark for video deepfake detection—comprising approximately 5.4K real/forged videos annotated with fine-grained natural-language explanations and click-based spatial localization labels. We formally define the video-level explainable deepfake detection task and introduce dual-modal supervision (textual and click-based) to jointly optimize forgery localization and semantic explanation. Leveraging vision-language models, our framework integrates parameter-efficient fine-tuning with in-context learning for multi-granularity joint training. Extensive experiments demonstrate that both supervision signals are indispensable: our model achieves precise spatial localization of forged regions and generates coherent, human-aligned textual explanations. The dataset, code, and trained models are publicly released.

Technology Category

Application Category

📝 Abstract
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
Problem

Research questions and friction points this paper is trying to address.

Detecting deepfake videos with explainable AI methods
Addressing errors and lack of explainability in deepfake detectors
Providing a dataset for training explainable deepfake detection models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ExDDV dataset for explainable deepfake detection
Uses text and click annotations for artifact localization
Evaluates vision-language models with fine-tuning strategies
🔎 Similar Papers
No similar papers found.
Vlad Hondru
Vlad Hondru
PhD Student, University of Bucharest, Romania; Machine Learning Engineer, eMAG
Machine LearningComputer VisionNLPDiffusion Models
E
Eduard Hogea
West University of Timisoara, Romania
D
D. Onchis
West University of Timisoara, Romania
R
R. Ionescu
University of Bucharest, Romania