🤖 AI Summary
Addressing the dual challenges of poor interpretability and limited generalization in AI-generated image (AIGI) detection, this work introduces Holmes-Set—the first large-scale benchmark with fine-grained, human-annotated explanations—and proposes the Holmes Pipeline, a novel three-stage training framework. The pipeline synergistically integrates multimodal large language models (MLLMs), vision expert pretraining, supervised fine-tuning, and direct preference optimization (DPO), underpinned by the Multi-Expert Jury annotation protocol. This enables structured model explanations, cross-model quality control, and explicit alignment with human preferences—achieving systematic interpretability for the first time. Evaluated on three major benchmarks, AIGI-Holmes achieves significant gains in both detection accuracy and explanation fidelity, demonstrating strong zero-shot generalization to unseen generative models. Our approach establishes a new paradigm for trustworthy, interpretable, and generalizable AIGI detection.
📝 Abstract
The rapid development of AI-generated content (AIGC) technology has led to the misuse of highly realistic AI-generated images (AIGI) in spreading misinformation, posing a threat to public information security. Although existing AIGI detection techniques are generally effective, they face two issues: 1) a lack of human-verifiable explanations, and 2) a lack of generalization in the latest generation technology. To address these issues, we introduce a large-scale and comprehensive dataset, Holmes-Set, which includes the Holmes-SFTSet, an instruction-tuning dataset with explanations on whether images are AI-generated, and the Holmes-DPOSet, a human-aligned preference dataset. Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control via cross-model evaluation, expert defect filtering, and human preference modification. In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization. Holmes Pipeline adapts multimodal large language models (MLLMs) for AIGI detection while generating human-verifiable and human-aligned explanations, ultimately yielding our model AIGI-Holmes. During the inference stage, we introduce a collaborative decoding strategy that integrates the model perception of the visual expert with the semantic reasoning of MLLMs, further enhancing the generalization capabilities. Extensive experiments on three benchmarks validate the effectiveness of our AIGI-Holmes.