🤖 AI Summary
Text-to-image (T2I) models excel in visual fidelity but suffer from poor compositional generalization—particularly in modeling object relations, attribute binding, and fine-grained details—largely due to the absence of explicit discriminative training on compositionally similar prompts or images. To address this, we propose an embodied multi-tool LLM agent framework that autonomously constructs high-discriminability compositional contrastive datasets. We further introduce Agentic Preference Optimization (APO), an agent-driven preference tuning method that jointly leverages image generation, editing, and visual question answering tools, integrated with reward modeling for end-to-end compositional reasoning—without degrading visual quality. Evaluated on T2I-CompBench and related benchmarks, our approach achieves state-of-the-art performance. Notably, it also yields unexpected improvements in auxiliary capabilities such as text rendering, despite no explicit optimization for these tasks.
📝 Abstract
Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.