AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing methods struggle to effectively detect manipulations in AI-generated academic images, and there is a lack of comprehensive evaluation frameworks tailored to the complexities of scholarly contexts. This work introduces the first multidimensional forensic benchmark specifically designed for academic imagery, encompassing seven major categories with 39 subtypes and four prevalent forgery strategies, enabling systematic assessment of detection, localization, and reasoning capabilities. By integrating 25 generative models, 25 multimodal large language models (MLLMs), and nine expert detectors, the study reveals a pronounced performance gap between generation and forensic technologies. Experimental results show that GPT-5.1 achieves only 48.80% overall accuracy, while expert models attain a localization IoU of 30.09%; MLLMs excel at identifying textual artifacts (84.74% accuracy), whereas specialized detectors reach up to 79.54% accuracy in authenticity classification.

📝 Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

Problem

Research questions and friction points this paper is trying to address.

AI-generated academic images

forensic analysis

benchmark

image forgery

multimodal evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-generated academic images

forensic benchmark

domain-specific complexity