🤖 AI Summary
Current text-to-image (T2I) models exhibit insufficient robustness when generating images from long, complex prompts involving multiple objects, attributes, and intricate spatial relations; moreover, mainstream evaluation metrics (e.g., CLIPScore) fail to capture fine-grained semantic alignment. To address this, we introduce LongBench-T2I—the first comprehensive benchmark for complex instruction evaluation—comprising 500+ meticulously curated samples spanning nine-dimensional visual-semantic criteria. We further propose Plan2Gen, a plug-and-play, zero-shot agent framework that leverages large language models for prompt parsing, stepwise planning, and modular prompt decomposition, enabling seamless integration with arbitrary black-box T2I models without fine-tuning. We also release a multidimensional automated evaluation toolkit. Experiments reveal that LongBench-T2I effectively exposes critical capability gaps in state-of-the-art models; Plan2Gen improves multi-constraint generation accuracy by over 37% without modifying base models. All data, code, and evaluation tools are publicly open-sourced.
📝 Abstract
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.