π€ AI Summary
Current multimodal large language models (MLLMs) for deepfake detection often conflate evidence generation with manipulation localization, leading to unreliable explanations and susceptibility to hallucination. This work proposes VIGIL, a facial-component-aware structured forensic framework that adopts a βplan-then-verifyβ paradigm: it first identifies suspicious regions using global cues and then injects region-specific forensic evidence through a stage-gated mechanism for focused analysis. By integrating three-stage progressive training, anatomical plausibility rewards, and component-level feature extraction, VIGIL ensures faithful and interpretable reasoning. Evaluated on the OmniFake benchmark and in cross-dataset settings, VIGIL consistently outperforms existing MLLM-based approaches and specialized detectors across all generalization levels, demonstrating superior robustness and explainability.
π Abstract
Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.