Compositional Image Synthesis with Inference-Time Scaling

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current text-to-image models exhibit significant limitations in compositional reasoning—particularly in modeling object count, attributes, and spatial relationships. To address this, we propose a training-free, inference-time optimization framework: first, a large language model generates an explicit scene layout; then, an object-centric vision-language model performs fine-grained evaluation and iterative re-ranking over multiple candidate images; finally, a self-refinement mechanism dynamically corrects generation outputs. This is the first method to deeply integrate explicit layout guidance with inference-time self-optimization. Experiments demonstrate that our approach substantially improves fine-grained text–image alignment—especially on complex compositional tasks—while preserving high aesthetic quality. It consistently outperforms existing state-of-the-art models across comprehensive benchmarks.

Technology Category

Application Category

📝 Abstract

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image models' compositionality for object counts

Enhancing attribute accuracy and spatial relations in generated images

Achieving better layout faithfulness while preserving aesthetic quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs to synthesize explicit layouts from prompts

Injects layouts into generation with object-centric VLM

Uses self-refinement to iteratively rerank candidate images

🔎 Similar Papers

No similar papers found.