AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Addressing the dual challenges of poor interpretability and limited generalization in AI-generated image (AIGI) detection, this work introduces Holmes-Set—the first large-scale benchmark with fine-grained, human-annotated explanations—and proposes the Holmes Pipeline, a novel three-stage training framework. The pipeline synergistically integrates multimodal large language models (MLLMs), vision expert pretraining, supervised fine-tuning, and direct preference optimization (DPO), underpinned by the Multi-Expert Jury annotation protocol. This enables structured model explanations, cross-model quality control, and explicit alignment with human preferences—achieving systematic interpretability for the first time. Evaluated on three major benchmarks, AIGI-Holmes achieves significant gains in both detection accuracy and explanation fidelity, demonstrating strong zero-shot generalization to unseen generative models. Our approach establishes a new paradigm for trustworthy, interpretable, and generalizable AIGI detection.

Technology Category

Application Category

📝 Abstract

The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.

Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated images with explainable results

Improve generalization across latest generation technologies

Address misuse of realistic AI images in misinformation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Large Language Models for detection

Multi-Expert Jury data annotation method

Three-stage training framework Holmes Pipeline

🔎 Similar Papers

FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models