MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work addresses the limited generalization of existing methods in detecting and localizing face forgeries generated by diffusion models, which often fail to effectively leverage fine-grained textual and other multimodal cues. To overcome this, the authors propose a language-guided, multi-domain fine-grained vision-language reconstruction framework that jointly models forgery traces in both image and residual domains. The framework integrates a fine-grained language Transformer, a multi-domain visual encoder-decoder, and a plug-and-play visual injection module to enable cross-modal representation fusion. This approach is the first to systematically tackle the challenges of generalized detection and localization of diffusion-based face forgeries, achieving state-of-the-art performance across diverse settings—including cross-generator, cross-forgery-type, and cross-dataset scenarios—while demonstrating superior generalization capability and localization accuracy.

📝 Abstract

The swift advancement in photo-realistic face generation technology has sparked considerable concerns across society and academia, emphasizing the requirement of generalizable face forgery detection and localization methods. Prior works tend to capture face forgery patterns across multiple domains using image modality, other modalities like fine-grained texts are not comprehensively investigated, which restricts the generalization capability of models. Besides, they usually analyze facial images created by GAN, but struggle to identify and localize those synthesized by diffusion. To solve the problems, in this paper, we devise a novel multi-domain fine-grained vision-language reconstruction (MFVLR) model, which explores comprehensive and diverse visual forgery traces via language-guided face forgery representation learning, to achieve generalizable diffusion-synthesized face forgery detection and localization (DFFDL). Specifically, we devise a fine-grained language transformer that studies general fine-grained language embeddings using language reconstruction. We propose a multi-domain vision encoder to capture general and complementary visual forgery patterns across the image and residual domains. A vision decoder is designed to reconstruct image appearance and achieve forgery localization. Besides, we propose an innovative plug-and-play vision injection module to enhance the interaction between the vision and language embeddings. Extensive experiments and visualizations demonstrate that our network outperforms the state of the art on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

Problem

Research questions and friction points this paper is trying to address.

face forgery detection

diffusion synthesis

vision-language

generalization

multi-domain

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language reconstruction

diffusion face forgery

multi-domain representation