DivCon: Divide and Conquer for Progressive Text-to-Image Generation

📅 2024-03-11
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Current diffusion models exhibit weak layout modeling capability and poor controllability for text-to-image generation involving multiple objects and complex spatial relationships; moreover, they rely on proprietary large language models (LLMs) for bounding-box prediction, limiting scalability and accessibility. This paper proposes a two-stage decoupled framework: (1) layout modeling via disentangled numerical/spatial reasoning and bounding-box generation, and (2) progressive object synthesis through iterative conditional generation—starting from simpler to more complex objects. The method integrates LLM-based reasoning, layout-guided diffusion models, and joint numerical-spatial modeling. Evaluated on the HRS and NSR-1K benchmarks, our approach significantly outperforms state-of-the-art methods, achieving substantial improvements in multi-object positional accuracy, count consistency, and layout controllability.

Technology Category

Application Category

📝 Abstract
Diffusion-driven text-to-image (T2I) generation has achieved remarkable advancements. To further improve T2I models' capability in numerical and spatial reasoning, the layout is employed as an intermedium to bridge large language models and layout-based diffusion models. However, these methods still struggle with generating images from textural prompts with multiple objects and complicated spatial relationships. To tackle this challenge, we introduce a divide-and-conquer approach which decouples the T2I generation task into simple subtasks. Our approach divides the layout prediction stage into numerical&spatial reasoning and bounding box prediction. Then, the layout-to-image generation stage is conducted in an iterative manner to reconstruct objects from easy ones to difficult ones. We conduct experiments on the HRS and NSR-1K benchmarks and our approach outperforms previous state-of-the-art models with notable margins. In addition, visual results demonstrate that our approach significantly improves the controllability and consistency in generating multiple objects from complex textural prompts.
Problem

Research questions and friction points this paper is trying to address.

Improving numerical and spatial reasoning in text-to-image generation
Overcoming limitations of closed-source large language models
Handling multiple objects with complex spatial relationships
Innovation

Methods, ideas, or system contributions that make the work stand out.

Divide-and-conquer approach for generation tasks
Lightweight LLMs achieve comparable layout accuracy
Two-step synthesis from easy to difficult objects
🔎 Similar Papers
No similar papers found.
Y
Yuhao Jia
Individual Researcher
W
Wenhan Tan
Individual Researcher