LongT2IBench: A Benchmark for Evaluating Long Text-to-Image Generation with Graph-structured Annotations

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image (T2I) evaluation benchmarks are constrained by short prompts and coarse-grained subjective scoring (e.g., Mean Opinion Score), rendering them inadequate for fine-grained, interpretable alignment assessment of long-text–driven image generation. Method: We introduce LongT2IBench, the first structured benchmark tailored for long-text T2I evaluation, comprising 14K long-text–image pairs annotated with graph-structured human labels grounded in entities, attributes, and relations. We propose a novel “generate-refine-validate” graph annotation protocol and Hierarchical Alignment Chain-of-Thought (CoT), integrating multimodal large models (MLLMs), instruction tuning, and alignment quantification to enable interpretable, fine-grained cross-modal matching. Contribution/Results: We publicly release both the LongT2IBench dataset and the LongT2IExpert evaluation model, which significantly outperform state-of-the-art methods in alignment scoring accuracy and structured explanation generation.

Technology Category

Application Category

📝 Abstract
The increasing popularity of long Text-to-Image (T2I) generation has created an urgent need for automatic and interpretable models that can evaluate the image-text alignment in long prompt scenarios. However, the existing T2I alignment benchmarks predominantly focus on short prompt scenarios and only provide MOS or Likert scale annotations. This inherent limitation hinders the development of long T2I evaluators, particularly in terms of the interpretability of alignment. In this study, we contribute LongT2IBench, which comprises 14K long text-image pairs accompanied by graph-structured human annotations. Given the detail-intensive nature of long prompts, we first design a Generate-Refine-Qualify annotation protocol to convert them into textual graph structures that encompass entities, attributes, and relations. Through this transformation, fine-grained alignment annotations are achieved based on these granular elements. Finally, the graph-structed annotations are converted into alignment scores and interpretations to facilitate the design of T2I evaluation models. Based on LongT2IBench, we further propose LongT2IExpert, a LongT2I evaluator that enables multi-modal large language models (MLLMs) to provide both quantitative scores and structured interpretations through an instruction-tuning process with Hierarchical Alignment Chain-of-Thought (CoT). Extensive experiments and comparisons demonstrate the superiority of the proposed LongT2IExpert in alignment evaluation and interpretation. Data and code have been released in https://welldky.github.io/LongT2IBench-Homepage/.
Problem

Research questions and friction points this paper is trying to address.

Addresses lack of benchmarks for long text-to-image alignment evaluation
Introduces graph-structured annotations for interpretable fine-grained assessment
Proposes a model providing quantitative scores and structured interpretations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-structured annotations for fine-grained image-text alignment
Hierarchical Alignment Chain-of-Thought for interpretable evaluation
Instruction-tuning MLLMs to provide scores and structured interpretations
Z
Zhichao Yang
School of Artificial Intelligence, Xidian University
T
Tianjiao Gu
School of Artificial Intelligence, Xidian University
J
Jianjie Wang
School of Artificial Intelligence, Xidian University
F
Feiyu Lin
School of Artificial Intelligence, Xidian University
X
Xiangfei Sheng
School of Artificial Intelligence, Xidian University
P
Pengfei Chen
School of Artificial Intelligence, Xidian University
Leida Li
Leida Li
Xidian University, China
Visual quality evaluationComputational aestheticsAffective computing