🤖 AI Summary
This work addresses the limitation of existing vision-language models in detecting deepfakes, which often fail to capture temporal inconsistencies in videos due to insufficient dynamic cue reasoning. The study formulates deepfake detection as a multi-level visual-language reasoning task and introduces FAQ, a large-scale multiple-choice benchmark encompassing three hierarchical levels: facial perception, temporal forgery localization, and forensic reasoning. Accompanying this benchmark is FAQ-IT, an instruction-tuning dataset designed to support end-to-end hierarchical training and evaluation through a multi-granularity temporal analysis framework integrated with a multiple-choice question-answering paradigm. Experiments demonstrate that models fine-tuned on FAQ-IT achieve state-of-the-art performance on both in-domain and cross-dataset deepfake detection tasks, while ablation studies confirm the critical role of the proposed benchmark in enhancing temporal reasoning capabilities.
📝 Abstract
Current Vision-Language Models (VLMs) for deepfake detection excel at identifying spatial artifacts but overlook a critical dimension: temporal inconsistencies in video forgeries. Adapting VLMs to reason about these dynamic cues remains a distinct challenge. To bridge this gap, we propose Forensic Answer-Questioning (FAQ), a large-scale benchmark that formulates temporal deepfake analysis as a multiple-choice task. FAQ introduces a three-level hierarchy to progressively evaluate and equip VLMs with forensic capabilities: (1) Facial Perception, testing the ability to identify static visual artifacts; (2) Temporal Deepfake Grounding, requiring the localization of dynamic forgery artifacts across frames; and (3) Forensic Reasoning, challenging models to synthesize evidence for final authenticity verdicts. We evaluate a range of VLMs on FAQ and generate a corresponding instruction-tuning set, FAQ-IT. Extensive experiments show that models fine-tuned on FAQ-IT achieve advanced performance on both in-domain and cross-dataset detection benchmarks. Ablation studies further validate the impact of our key design choices, confirming that FAQ is the driving force behind the temporal reasoning capabilities of these VLMs.