VLForgery Face Triad: Detection, Localization and Attribution via Multimodal Large Language Models

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Diffusion-based face generation poses significant challenges for deepfake detection—particularly in localizing forged regions, attributing generation methods, and interpreting underlying forgery mechanisms. Method: This paper proposes a fine-grained, tripartite framework integrating detection, localization, and attribution. We introduce the first vision-language forensic dataset (VLF), propose an exogenous-knowledge-guided EkCot prompting method for descriptive reasoning, and design a low-level visual contrast feature injection mechanism to enable multi-stage cascaded inference. The approach synergistically combines multimodal large language models (MLLMs), diffusion-generation analysis, exogenous-knowledge-enhanced prompt engineering, and low-level visual discrepancy modeling. Contribution/Results: Our method achieves state-of-the-art detection accuracy and—uniquely within a unified framework—simultaneously delivers pixel-level forgery localization and generator-level attribution. This significantly enhances model interpretability and practical utility for forensic analysis.

Technology Category

Application Category

📝 Abstract

Faces synthesized by diffusion models (DMs) with high-quality and controllable attributes pose a significant challenge for Deepfake detection. Most state-of-the-art detectors only yield a binary decision, incapable of forgery localization, attribution of forgery methods, and providing analysis on the cause of forgeries. In this work, we integrate Multimodal Large Language Models (MLLMs) within DM-based face forensics, and propose a fine-grained analysis triad framework called VLForgery, that can 1) predict falsified facial images; 2) locate the falsified face regions subjected to partial synthesis; and 3) attribute the synthesis with specific generators. To achieve the above goals, we introduce VLF (Visual Language Forensics), a novel and diverse synthesis face dataset designed to facilitate rich interactions between Visual and Language modalities in MLLMs. Additionally, we propose an extrinsic knowledge-guided description method, termed EkCot, which leverages knowledge from the image generation pipeline to enable MLLMs to quickly capture image content. Furthermore, we introduce a low-level vision comparison pipeline designed to identify differential features between real and fake that MLLMs can inherently understand. These features are then incorporated into EkCot, enhancing its ability to analyze forgeries in a structured manner, following the sequence of detection, localization, and attribution. Extensive experiments demonstrate that VLForgery outperforms other state-of-the-art forensic approaches in detection accuracy, with additional potential for falsified region localization and attribution analysis.

Problem

Research questions and friction points this paper is trying to address.

Detect high-quality synthetic faces from diffusion models

Localize falsified regions in partially synthesized faces

Attribute specific generators used in face synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models for face forensics

VLF dataset enhances visual-language interaction

EkCot method leverages image generation knowledge

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models