ViGText: Deepfake Image Detection with Vision-Language Model Explanations and Graph Neural Networks

📅 2025-07-23

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the weak generalization and poor robustness of deepfake image detection, this paper proposes a novel method integrating vision-language explanation with graph neural networks (GNNs). We innovatively leverage vision-language large models (VLLMs) to generate fine-grained, image-text explanations—replacing labor-intensive manual annotations—and construct a joint image-patch–text-description graph. This graph incorporates multi-scale patching and dual-path feature extraction (spatial and frequency domains) to enable context-aware multimodal reasoning. The approach significantly enhances interpretability and customizable deepfake identification: it achieves a 98.32% F1-score on general benchmarks (+25.87 points), improves recall by 11.1%, and sustains performance degradation of ≤4% under targeted adversarial attacks. These results mark substantial advances in both generalization capability and robustness.

Technology Category

Application Category

📝 Abstract

The rapid rise of deepfake technology, which produces realistic but fraudulent digital content, threatens the authenticity of media. Traditional deepfake detection approaches often struggle with sophisticated, customized deepfakes, especially in terms of generalization and robustness against malicious attacks. This paper introduces ViGText, a novel approach that integrates images with Vision Large Language Model (VLLM) Text explanations within a Graph-based framework to improve deepfake detection. The novelty of ViGText lies in its integration of detailed explanations with visual data, as it provides a more context-aware analysis than captions, which often lack specificity and fail to reveal subtle inconsistencies. ViGText systematically divides images into patches, constructs image and text graphs, and integrates them for analysis using Graph Neural Networks (GNNs) to identify deepfakes. Through the use of multi-level feature extraction across spatial and frequency domains, ViGText captures details that enhance its robustness and accuracy to detect sophisticated deepfakes. Extensive experiments demonstrate that ViGText significantly enhances generalization and achieves a notable performance boost when it detects user-customized deepfakes. Specifically, average F1 scores rise from 72.45% to 98.32% under generalization evaluation, and reflects the model's superior ability to generalize to unseen, fine-tuned variations of stable diffusion models. As for robustness, ViGText achieves an increase of 11.1% in recall compared to other deepfake detection approaches. When facing targeted attacks that exploit its graph-based architecture, ViGText limits classification performance degradation to less than 4%. ViGText uses detailed visual and textual analysis to set a new standard for detecting deepfakes, helping ensure media authenticity and information integrity.

Problem

Research questions and friction points this paper is trying to address.

Detecting sophisticated deepfakes with improved generalization and robustness

Integrating vision-language explanations for context-aware deepfake analysis

Enhancing detection accuracy against customized deepfakes and malicious attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Vision-Language Model with Graph Neural Networks

Uses multi-level spatial and frequency feature extraction

Divides images into patches for detailed analysis

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models