CompAlign: Improving Compositional Text-to-Image Generation with a Complex Benchmark and Fine-Grained Feedback

📅 2025-05-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current text-to-image (T2I) models excel at high-resolution generation but struggle with compositional alignment—particularly in scenarios involving multiple objects, diverse attributes, and precise 3D spatial relationships. Method: We introduce (1) CompAlign, the first benchmark explicitly targeting numerically grounded 3D spatial constraints, comprising 900 multi-object prompts; (2) CompQuest, an interpretable, fine-grained evaluation framework that decomposes composite prompts into atomic sub-questions and leverages multimodal large language models (MLLMs) for binary verification; and (3) a preference-driven diffusion alignment method for T2I, enabling per-image tunable preference optimization. Results: Experiments across nine mainstream T2I models reveal substantial performance degradation on complex 3D configuration tasks, with open-source models significantly underperforming closed-source counterparts. After alignment, compositional accuracy improves markedly—surpassing state-of-the-art methods. The benchmark and code are publicly released.

Technology Category

Application Category

📝 Abstract
State-of-the-art T2I models are capable of generating high-resolution images given textual prompts. However, they still struggle with accurately depicting compositional scenes that specify multiple objects, attributes, and spatial relations. We present CompAlign, a challenging benchmark with an emphasis on assessing the depiction of 3D-spatial relationships, for evaluating and improving models on compositional image generation. CompAlign consists of 900 complex multi-subject image generation prompts that combine numerical and 3D-spatial relationships with varied attribute bindings. Our benchmark is remarkably challenging, incorporating generation tasks with 3+ generation subjects with complex 3D-spatial relationships. Additionally, we propose CompQuest, an interpretable and accurate evaluation framework that decomposes complex prompts into atomic sub-questions, then utilizes a MLLM to provide fine-grained binary feedback on the correctness of each aspect of generation elements in model-generated images. This enables precise quantification of alignment between generated images and compositional prompts. Furthermore, we propose an alignment framework that uses CompQuest's feedback as preference signals to improve diffusion models' compositional image generation abilities. Using adjustable per-image preferences, our method is easily scalable and flexible for different tasks. Evaluation of 9 T2I models reveals that: (1) models remarkable struggle more with compositional tasks with more complex 3D-spatial configurations, and (2) a noticeable performance gap exists between open-source accessible models and closed-source commercial models. Further empirical study on using CompAlign for model alignment yield promising results: post-alignment diffusion models achieve remarkable improvements in compositional accuracy, especially on complex generation tasks, outperforming previous approaches.
Problem

Research questions and friction points this paper is trying to address.

Improving accuracy in multi-object, attribute-rich compositional image generation
Evaluating 3D-spatial relationship depiction in text-to-image models
Enhancing model alignment with fine-grained feedback for complex prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces CompAlign benchmark for compositional T2I evaluation
Proposes CompQuest framework for fine-grained feedback using MLLM
Uses feedback as preference signals to align diffusion models
🔎 Similar Papers
No similar papers found.