Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

📅 2025-04-16

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Addressing key challenges in AI-generated image forensic analysis—namely, difficulty in identifying fine-grained forgeries, weak localization of tampered regions, and inability to trace generation methods—this paper proposes the first multimodal large language model (MLLM) framework tailored for fine-grained image forgery analysis. Our approach integrates semantic cue-driven multi-stage prompting, few-shot learning, cross-modal semantic alignment, and localized forgery cue modeling to unlock GPT-4V’s capabilities in four critical tasks: authenticity assessment, tamper localization, evidentiary explanation generation, and generative method attribution. Evaluated on Autosplice and LaMa benchmarks, our method achieves 92.1% and 86.3% detection accuracy, respectively—matching state-of-the-art specialized models. This work provides the first systematic empirical validation of general-purpose multimodal LLMs for AIGC forensic analysis, demonstrating their feasibility, effectiveness, and untapped potential in this domain.

Technology Category

Application Category

📝 Abstract

The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.

Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated images using multimodal LLMs

Localizing tampered regions and tracing generation methods

Improving forgery analysis accuracy with prompt engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs for forgery detection

Prompt engineering and few-shot learning

Localize tampered regions with semantic clues

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models