Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language models in detecting deepfakes, which often fail to capture temporal inconsistencies in videos due to insufficient dynamic cue reasoning. The study formulates deepfake detection as a multi-level visual-language reasoning task and introduces FAQ, a large-scale multiple-choice benchmark encompassing three hierarchical levels: facial perception, temporal forgery localization, and forensic reasoning. Accompanying this benchmark is FAQ-IT, an instruction-tuning dataset designed to support end-to-end hierarchical training and evaluation through a multi-granularity temporal analysis framework integrated with a multiple-choice question-answering paradigm. Experiments demonstrate that models fine-tuned on FAQ-IT achieve state-of-the-art performance on both in-domain and cross-dataset deepfake detection tasks, while ablation studies confirm the critical role of the proposed benchmark in enhancing temporal reasoning capabilities.

Technology Category

Application Category

📝 Abstract
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.
Problem

Research questions and friction points this paper is trying to address.

video deepfake
temporal inconsistency
vision-language models
forensic reasoning
deepfake detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal deepfake reasoning
vision-language models
forensic benchmark
video deepfake detection
instruction tuning
🔎 Similar Papers
No similar papers found.
Zheyuan Gu
Zheyuan Gu
Institute of Information Engineering, Chinese Academy of Sciences
Encrypted Traffic AnalysisCybercrime
Qingsong Zhao
Qingsong Zhao
tongji
Machine LearningComputer Vision
Yusong Wang
Yusong Wang
Tokyo Institute of Technology
Representation LearningAffective Computing
Z
Zhaohong Huang
Institute of Artificial Intelligence, China Telecom (TeleAI)
X
Xinqi Li
Peking University
Cheng Yuan
Cheng Yuan
Associate Professor, School of Mathematics and Statistics, Central China Normal University
Computational PhysicsDeep Learning
J
Jiaowei Shao
Institute of Artificial Intelligence, China Telecom (TeleAI)
C
Chi Zhang
Institute of Artificial Intelligence, China Telecom (TeleAI)
X
Xuelong Li
Institute of Artificial Intelligence, China Telecom (TeleAI)