🤖 AI Summary
Evaluating text-to-image (T2I) models remains challenging due to bottlenecks in assessing perceptual quality and text–image alignment, coupled with the high cost and low efficiency of human evaluation. Method: We introduce EvalMi-50K—the first large-scale benchmark tailored to multidimensional human preferences—comprising 50K generated images, 100K Mean Opinion Scores (MOS), and 50K question–answer (QA) pairs. We further propose LMM4LMM, an automatic, end-to-end evaluation metric powered by large multimodal models (LMMs), which eliminates manual sub-metric design and jointly assesses perceptual quality, text–image alignment, and task accuracy via fine-grained task classification, structured multi-turn prompting, and human preference modeling. Contribution/Results: On EvalMi-50K, LMM4LMM achieves state-of-the-art correlation (ρ = 0.89) and demonstrates superior cross-dataset generalization over CLIPScore, BLIPScore, and other baselines, consistently outperforming them on multiple external benchmarks.
📝 Abstract
Recent breakthroughs in large multimodal models (LMMs) have significantly advanced both text-to-image (T2I) generation and image-to-text (I2T) interpretation. However, many generated images still suffer from issues related to perceptual quality and text-image alignment. Given the high cost and inefficiency of manual evaluation, an automatic metric that aligns with human preferences is desirable. To this end, we present EvalMi-50K, a comprehensive dataset and benchmark for evaluating large-multimodal image generation, which features (i) comprehensive tasks, encompassing 2,100 extensive prompts across 20 fine-grained task dimensions, and (ii) large-scale human-preference annotations, including 100K mean-opinion scores (MOSs) and 50K question-answering (QA) pairs annotated on 50,400 images generated from 24 T2I models. Based on EvalMi-50K, we propose LMM4LMM, an LMM-based metric for evaluating large multimodal T2I generation from multiple dimensions including perception, text-image correspondence, and task-specific accuracy. Extensive experimental results show that LMM4LMM achieves state-of-the-art performance on EvalMi-50K, and exhibits strong generalization ability on other AI-generated image evaluation benchmark datasets, manifesting the generality of both the EvalMi-50K dataset and LMM4LMM metric. Both EvalMi-50K and LMM4LMM will be released at https://github.com/IntMeGroup/LMM4LMM.