Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Current text-to-image (T2I) models exhibit insufficient robustness when generating images from long, complex prompts involving multiple objects, attributes, and intricate spatial relations; moreover, mainstream evaluation metrics (e.g., CLIPScore) fail to capture fine-grained semantic alignment. To address this, we introduce LongBench-T2I—the first comprehensive benchmark for complex instruction evaluation—comprising 500+ meticulously curated samples spanning nine-dimensional visual-semantic criteria. We further propose Plan2Gen, a plug-and-play, zero-shot agent framework that leverages large language models for prompt parsing, stepwise planning, and modular prompt decomposition, enabling seamless integration with arbitrary black-box T2I models without fine-tuning. We also release a multidimensional automated evaluation toolkit. Experiments reveal that LongBench-T2I effectively exposes critical capability gaps in state-of-the-art models; Plan2Gen improves multi-constraint generation accuracy by over 37% without modifying base models. All data, code, and evaluation tools are publicly open-sourced.

Technology Category

Application Category

📝 Abstract

Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.

Problem

Research questions and friction points this paper is trying to address.

Evaluate T2I models' ability to follow complex multi-object instructions

Develop a benchmark for nuanced assessment of complex prompt adherence

Propose an agent framework for complex instruction-driven image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LongBench-T2I benchmark for complex instructions

Proposes Plan2Gen agent framework without extra training

Develops automated evaluation toolkit for multi-dimensional metrics

🔎 Similar Papers

DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation