ComposeAnything: Composite Object Priors for Text-to-Image Generation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing text-to-image (T2I) models struggle to generate high-fidelity images featuring complex, novel object compositions and precise 3D spatial relationships. To address this, we propose a general-purpose, fine-tuning-free enhancement framework for diffusion-based T2I generation. Our method leverages chain-of-thought reasoning in large language models (LLMs) to produce interpretable, editable 2.5D semantic layouts—encoding object categories, 2D positions, relative depth, and orientation—thereby establishing spatially and depth-aware composite object priors. These priors are integrated via object-level prior injection and spatially constrained denoising, enabling structured, geometry-informed generation. This work introduces the first 2.5D layout prior mechanism and an LLM-driven controllable generation paradigm. Extensive experiments on T2I-CompBench and NSR-1K demonstrate significant improvements over state-of-the-art methods, supporting high object density, fine-grained 3D relational reasoning, and surreal compositions. Human evaluation confirms superior compositional fidelity and image quality.

Technology Category

Application Category

📝 Abstract

Generating images from text involving complex and novel object arrangements remains a significant challenge for current text-to-image (T2I) models. Although prior layout-based methods improve object arrangements using spatial constraints with 2D layouts, they often struggle to capture 3D positioning and sacrifice quality and coherence. In this work, we introduce ComposeAnything, a novel framework for improving compositional image generation without retraining existing T2I models. Our approach first leverages the chain-of-thought reasoning abilities of LLMs to produce 2.5D semantic layouts from text, consisting of 2D object bounding boxes enriched with depth information and detailed captions. Based on this layout, we generate a spatial and depth aware coarse composite of objects that captures the intended composition, serving as a strong and interpretable prior that replaces stochastic noise initialization in diffusion-based T2I models. This prior guides the denoising process through object prior reinforcement and spatial-controlled denoising, enabling seamless generation of compositional objects and coherent backgrounds, while allowing refinement of inaccurate priors. ComposeAnything outperforms state-of-the-art methods on the T2I-CompBench and NSR-1K benchmarks for prompts with 2D/3D spatial arrangements, high object counts, and surreal compositions. Human evaluations further demonstrate that our model generates high-quality images with compositions that faithfully reflect the text.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image generation for complex object arrangements

Addressing 3D positioning and coherence issues in layout-based methods

Enhancing compositional image generation without retraining existing models

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs generate 2.5D semantic layouts

Spatial and depth aware coarse composite

Object prior reinforcement guides denoising

🔎 Similar Papers

No similar papers found.