EDVD-LLaMA: Explainable Deepfake Video Detection via Multimodal Large Language Model Reasoning

📅 2025-10-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the weak interpretability and poor generalization in deepfake video detection, this paper proposes EDVD-LLaMA: (1) the first formal definition of interpretable deepfake video detection; (2) a fine-grained multimodal chain-of-thought (Fg-MCoT) framework integrating spatiotemporal subtokenization (ST-SIT) and facial feature hard constraints to enable pixel-level spatiotemporal localization and traceable reasoning; and (3) ER-FF++set—the first benchmark dataset explicitly designed for interpretable deepfake video reasoning—supporting dual-supervised training. Extensive experiments demonstrate that EDVD-LLaMA achieves state-of-the-art accuracy and robustness under cross-generation-method and cross-dataset settings, significantly outperforming existing approaches while maintaining strong interpretability alongside high detection performance.

Technology Category

Application Category

📝 Abstract
The rapid development of deepfake video technology has not only facilitated artistic creation but also made it easier to spread misinformation. Traditional deepfake video detection (DVD) methods face issues such as a lack of transparency in their principles and insufficient generalization capabilities to cope with evolving forgery techniques. This highlights an urgent need for detectors that can identify forged content and provide verifiable reasoning explanations. This paper proposes the explainable deepfake video detection (EDVD) task and designs the EDVD-LLaMA multimodal, a large language model (MLLM) reasoning framework, which provides traceable reasoning processes alongside accurate detection results and trustworthy explanations. Our approach first incorporates a Spatio-Temporal Subtle Information Tokenization (ST-SIT) to extract and fuse global and local cross-frame deepfake features, providing rich spatio-temporal semantic information input for MLLM reasoning. Second, we construct a Fine-grained Multimodal Chain-of-Thought (Fg-MCoT) mechanism, which introduces facial feature data as hard constraints during the reasoning process to achieve pixel-level spatio-temporal video localization, suppress hallucinated outputs, and enhance the reliability of the chain of thought. In addition, we build an Explainable Reasoning FF++ benchmark dataset (ER-FF++set), leveraging structured data to annotate videos and ensure quality control, thereby supporting dual supervision for reasoning and detection. Extensive experiments demonstrate that EDVD-LLaMA achieves outstanding performance and robustness in terms of detection accuracy, explainability, and its ability to handle cross-forgery methods and cross-dataset scenarios. Compared to previous DVD methods, it provides a more explainable and superior solution. The source code and dataset will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Detecting deepfake videos with transparent reasoning and trustworthy explanations
Addressing poor generalization of traditional detectors against evolving forgery techniques
Providing pixel-level spatio-temporal localization to suppress hallucinated outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

ST-SIT extracts spatio-temporal deepfake features globally
Fg-MCoT uses facial constraints for pixel-level localization
ER-FF++ dataset provides structured annotations for dual supervision
🔎 Similar Papers
No similar papers found.
H
Haoran Sun
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR
C
Chen Cai
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore
Huiping Zhuang
Huiping Zhuang
Associate Professor, South China University of Technology
Continual LearningMulti-ModalEmbodied AILarge Model
Kong Aik Lee
Kong Aik Lee
The Hong Kong Polytechnic University, Hong Kong
Speaker and Spoken Language RecognitionSpeech ProcessingDigital Signal ProcessingSubband
Lap-Pui Chau
Lap-Pui Chau
The Hong Kong Polytechnic University
Visual Signal Processing
Y
Yi Wang
Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR