M$^{3}$T2IBench: A Large-Scale Multi-Category, Multi-Instance, Multi-Relation Text-to-Image Benchmark

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Current text-to-image models exhibit limited capability in aligning generated images with complex textual descriptions involving multiple instances, diverse categories, and intricate semantic relationships; moreover, fine-grained evaluation benchmarks strongly correlated with human judgment remain scarce. To address these challenges, we introduce M3R-Bench—the first large-scale, multi-category, multi-instance, multi-relation image-text alignment benchmark—and propose Revise-Then-Enforce, a training-free post-editing method. We further design AlignScore, an automatic metric that jointly models object detection and semantic parsing to assess both visual entities and their relational structure, achieving strong correlation with human preferences (ρ > 0.85). Experiments reveal significant performance bottlenecks of mainstream open-source diffusion models on M3R-Bench, confirming its rigor; Revise-Then-Enforce consistently improves alignment quality across models—including Stable Diffusion—yielding an average AlignScore gain of 12.7%. This work establishes a new paradigm for evaluating and optimizing image-text alignment.

Technology Category

Application Category

📝 Abstract

Text-to-image models are known to struggle with generating images that perfectly align with textual prompts. Several previous studies have focused on evaluating image-text alignment in text-to-image generation. However, these evaluations either address overly simple scenarios, especially overlooking the difficulty of prompts with multiple different instances belonging to the same category, or they introduce metrics that do not correlate well with human evaluation. In this study, we introduce M$^3$T2IBench, a large-scale, multi-category, multi-instance, multi-relation along with an object-detection-based evaluation metric, $AlignScore$, which aligns closely with human evaluation. Our findings reveal that current open-source text-to-image models perform poorly on this challenging benchmark. Additionally, we propose the Revise-Then-Enforce approach to enhance image-text alignment. This training-free post-editing method demonstrates improvements in image-text alignment across a broad range of diffusion models. footnote{Our code and data has been released in supplementary material and will be made publicly available after the paper is accepted.}

Problem

Research questions and friction points this paper is trying to address.

Evaluating multi-instance text-to-image alignment challenges

Developing human-correlated metrics for image-text evaluation

Improving alignment through training-free post-editing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces large-scale multi-category multi-instance benchmark

Proposes object-detection-based AlignScore evaluation metric

Develops training-free Revise-Then-Enforce post-editing method

🔎 Similar Papers

No similar papers found.