UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing T2I evaluation benchmarks suffer from narrow prompt scenarios, lack of multilingual support, and coarse-grained semantic coverage, hindering fine-grained semantic consistency assessment. To address these limitations, we propose the first unified cross-lingual T2I semantic evaluation benchmark, encompassing five major themes, twenty subthemes, and 600 bilingual (Chinese–English) short and long prompts. We design a hierarchical prompt schema and establish a multi-granularity evaluation framework with ten primary dimensions and twenty-seven subdimensions—enabling, for the first time, fine-grained semantic evaluation under variable-length prompts and multilingual settings. Leveraging Gemini-2.5-Pro, we build an automated evaluation pipeline and open-source a lightweight, offline-executable assessment model. Comprehensive evaluation of leading T2I models reveals their capability boundaries across semantic dimensions, thereby advancing standardization and rigor in T2I evaluation.

Technology Category

Application Category

📝 Abstract

Recent progress in text-to-image (T2I) generation underscores the importance of reliable benchmarks in evaluating how accurately generated images reflect the semantics of their textual prompt. However, (1) existing benchmarks lack the diversity of prompt scenarios and multilingual support, both essential for real-world applicability; (2) they offer only coarse evaluations across primary dimensions, covering a narrow range of sub-dimensions, and fall short in fine-grained sub-dimension assessment. To address these limitations, we introduce UniGenBench++, a unified semantic assessment benchmark for T2I generation. Specifically, it comprises 600 prompts organized hierarchically to ensure both coverage and efficiency: (1) spans across diverse real-world scenarios, i.e., 5 main prompt themes and 20 subthemes; (2) comprehensively probes T2I models' semantic consistency over 10 primary and 27 sub evaluation criteria, with each prompt assessing multiple testpoints. To rigorously assess model robustness to variations in language and prompt length, we provide both English and Chinese versions of each prompt in short and long forms. Leveraging the general world knowledge and fine-grained image understanding capabilities of a closed-source Multi-modal Large Language Model (MLLM), i.e., Gemini-2.5-Pro, an effective pipeline is developed for reliable benchmark construction and streamlined model assessment. Moreover, to further facilitate community use, we train a robust evaluation model that enables offline assessment of T2I model outputs. Through comprehensive benchmarking of both open- and closed-sourced T2I models, we systematically reveal their strengths and weaknesses across various aspects.

Problem

Research questions and friction points this paper is trying to address.

Evaluating semantic accuracy in text-to-image generation systems

Addressing lack of multilingual and scenario diversity in benchmarks

Providing fine-grained assessment across multiple evaluation dimensions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical prompt organization with 600 diverse scenarios

Multilingual evaluation using English and Chinese prompt versions

MLLM-powered pipeline for automated semantic consistency assessment

🔎 Similar Papers

Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings