VIGIL: Part-Grounded Structured Reasoning for Generalizable Deepfake Detection

πŸ“… 2026-03-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current multimodal large language models (MLLMs) for deepfake detection often conflate evidence generation with manipulation localization, leading to unreliable explanations and susceptibility to hallucination. This work proposes VIGIL, a facial-component-aware structured forensic framework that adopts a β€œplan-then-verify” paradigm: it first identifies suspicious regions using global cues and then injects region-specific forensic evidence through a stage-gated mechanism for focused analysis. By integrating three-stage progressive training, anatomical plausibility rewards, and component-level feature extraction, VIGIL ensures faithful and interpretable reasoning. Evaluated on the OmniFake benchmark and in cross-dataset settings, VIGIL consistently outperforms existing MLLM-based approaches and specialized detectors across all generalization levels, demonstrating superior robustness and explainability.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) offer a promising path toward interpretable deepfake detection by generating textual explanations. However, the reasoning process of current MLLM-based methods combines evidence generation and manipulation localization into a unified step. This combination blurs the boundary between faithful observations and hallucinated explanations, leading to unreliable conclusions. Building on this, we present VIGIL, a part-centric structured forensic framework inspired by expert forensic practice through a plan-then-examine pipeline: the model first plans which facial parts warrant inspection based on global visual cues, then examines each part with independently sourced forensic evidence. A stage-gated injection mechanism delivers part-level forensic evidence only during examination, ensuring that part selection remains driven by the model's own perception rather than biased by external signals. We further propose a progressive three-stage training paradigm whose reinforcement learning stage employs part-aware rewards to enforce anatomical validity and evidence--conclusion coherence. To enable rigorous generalizability evaluation, we construct OmniFake, a hierarchical 5-Level benchmark where the model, trained on only three foundational generators, is progressively tested up to in-the-wild social-media data. Extensive experiments on OmniFake and cross-dataset evaluations demonstrate that VIGIL consistently outperforms both expert detectors and concurrent MLLM-based methods across all generalizability levels.
Problem

Research questions and friction points this paper is trying to address.

deepfake detection
multimodal large language models
reasoning reliability
generalizability
forensic analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

structured reasoning
part-grounded detection
stage-gated evidence injection
progressive training paradigm
deepfake generalizability
πŸ”Ž Similar Papers
No similar papers found.
Xinghan Li
Xinghan Li
ZJU
roboticsstate estimationembodied AI
J
Junhao Xu
Institute of Trustworthy Embodied AI, Fudan University; Shanghai Key Laboratory of Multimodal Embodied AI
Jingjing Chen
Jingjing Chen
Fudan University
MultimediaComputer VisionMachine LearningPattern recognition