AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
Existing methods struggle to effectively detect manipulations in AI-generated academic images, and there is a lack of comprehensive evaluation frameworks tailored to the complexities of scholarly contexts. This work introduces the first multidimensional forensic benchmark specifically designed for academic imagery, encompassing seven major categories with 39 subtypes and four prevalent forgery strategies, enabling systematic assessment of detection, localization, and reasoning capabilities. By integrating 25 generative models, 25 multimodal large language models (MLLMs), and nine expert detectors, the study reveals a pronounced performance gap between generation and forensic technologies. Experimental results show that GPT-5.1 achieves only 48.80% overall accuracy, while expert models attain a localization IoU of 30.09%; MLLMs excel at identifying textual artifacts (84.74% accuracy), whereas specialized detectors reach up to 79.54% accuracy in authenticity classification.
📝 Abstract
We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.
Problem

Research questions and friction points this paper is trying to address.

AI-generated academic images
forensic analysis
benchmark
image forgery
multimodal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

AI-generated academic images
forensic benchmark
domain-specific complexity
forgery simulation
multimodal evaluation