🤖 AI Summary
This study addresses two critical gaps in AI-generated image (AIGI) evaluation: the absence of systematic benchmarks and insufficient understanding of large multimodal model (LMM) discrimination capabilities. To this end, we introduce A-Bench—the first dedicated benchmark for AIGI assessment—comprising 2,864 images from 16 text-to-image models (e.g., Stable Diffusion, DALL·E, SDXL) and high-quality expert-annotated question-answer pairs. It supports comprehensive evaluation across 18 state-of-the-art LMMs (e.g., GPT-4V, Qwen-VL, LLaVA). We propose the first “semantic-detail dual-capability” diagnostic framework, jointly assessing high-level semantic comprehension and low-level visual fidelity, while ensuring bidirectional diversity coverage across both generative and evaluative models. Empirical analysis reveals pervasive semantic biases and fine-grained detail blindness in current LMMs. All data and evaluation code are publicly released to advance standardized, reproducible AIGI assessment research.
📝 Abstract
How to accurately and efficiently assess AI-generated images (AIGIs) remains a critical challenge for generative models. Given the high costs and extensive time commitments required for user studies, many researchers have turned towards employing large multi-modal models (LMMs) as AIGI evaluators, the precision and validity of which are still questionable. Furthermore, traditional benchmarks often utilize mostly natural-captured content rather than AIGIs to test the abilities of LMMs, leading to a noticeable gap for AIGIs. Therefore, we introduce A-Bench in this paper, a benchmark designed to diagnose whether LMMs are masters at evaluating AIGIs. Specifically, A-Bench is organized under two key principles: 1) Emphasizing both high-level semantic understanding and low-level visual quality perception to address the intricate demands of AIGIs. 2) Various generative models are utilized for AIGI creation, and various LMMs are employed for evaluation, which ensures a comprehensive validation scope. Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs. We hope that A-Bench will significantly enhance the evaluation process and promote the generation quality for AIGIs. The benchmark is available at https://github.com/Q-Future/A-Bench.