AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

📅 2025-12-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image (T2I) models excel in visual fidelity but suffer from poor compositional generalization—particularly in modeling object relations, attribute binding, and fine-grained details—largely due to the absence of explicit discriminative training on compositionally similar prompts or images. To address this, we propose an embodied multi-tool LLM agent framework that autonomously constructs high-discriminability compositional contrastive datasets. We further introduce Agentic Preference Optimization (APO), an agent-driven preference tuning method that jointly leverages image generation, editing, and visual question answering tools, integrated with reward modeling for end-to-end compositional reasoning—without degrading visual quality. Evaluated on T2I-CompBench and related benchmarks, our approach achieves state-of-the-art performance. Notably, it also yields unexpected improvements in auxiliary capabilities such as text rendering, despite no explicit optimization for these tasks.

Technology Category

Application Category

📝 Abstract
Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that models are not explicitly trained to differentiate between compositionally similar prompts and images, resulting in outputs that are close to the intended description yet deviate in fine-grained details. To address this, we propose AgentComp, a framework that explicitly trains models to better differentiate such compositional variations and enhance their reasoning ability. AgentComp leverages the reasoning and tool-use capabilities of large language models equipped with image generation, editing, and VQA tools to autonomously construct compositional datasets. Using these datasets, we apply an agentic preference optimization method to fine-tune text-to-image models, enabling them to better distinguish between compositionally similar samples and resulting in overall stronger compositional generation ability. AgentComp achieves state-of-the-art results on compositionality benchmarks such as T2I-CompBench, without compromising image quality$-$a common drawback in prior approaches$-$and even generalizes to other capabilities not explicitly trained for, such as text rendering.
Problem

Research questions and friction points this paper is trying to address.

Enhances compositional accuracy in text-to-image generation
Differentiates between compositionally similar prompts and images
Improves object relationships and attribute binding in outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autonomous dataset construction using LLMs with tools
Agentic preference optimization for model fine-tuning
State-of-the-art compositional generation without quality loss
🔎 Similar Papers
No similar papers found.