Beyond Static Artifacts: A Forensic Benchmark for Video Deepfake Reasoning in Vision Language Models

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language models in detecting deepfakes, which often fail to capture temporal inconsistencies in videos due to insufficient dynamic cue reasoning. The study formulates deepfake detection as a multi-level visual-language reasoning task and introduces FAQ, a large-scale multiple-choice benchmark encompassing three hierarchical levels: facial perception, temporal forgery localization, and forensic reasoning. Accompanying this benchmark is FAQ-IT, an instruction-tuning dataset designed to support end-to-end hierarchical training and evaluation through a multi-granularity temporal analysis framework integrated with a multiple-choice question-answering paradigm. Experiments demonstrate that models fine-tuned on FAQ-IT achieve state-of-the-art performance on both in-domain and cross-dataset deepfake detection tasks, while ablation studies confirm the critical role of the proposed benchmark in enhancing temporal reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.

Problem

Research questions and friction points this paper is trying to address.

video deepfake

temporal inconsistency

vision-language models

forensic reasoning

deepfake detection

Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal deepfake reasoning

vision-language models

forensic benchmark