VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) for deepfake detection often conflate evidence generation with manipulation localization, leading to unreliable explanations and susceptibility to hallucination. This work proposes VIGIL, a facial-component-aware structured forensic framework that adopts a “plan-then-verify” paradigm: it first identifies suspicious regions using global cues and then injects region-specific forensic evidence through a stage-gated mechanism for focused analysis. By integrating three-stage progressive training, anatomical plausibility rewards, and component-level feature extraction, VIGIL ensures faithful and interpretable reasoning. Evaluated on the OmniFake benchmark and in cross-dataset settings, VIGIL consistently outperforms existing MLLM-based approaches and specialized detectors across all generalization levels, demonstrating superior robustness and explainability.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.

Problem

Research questions and friction points this paper is trying to address.

deepfake detection

multimodal large language models

reasoning reliability

generalizability

forensic analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning

part-grounded detection

stage-gated evidence injection