Enhancing Compositional Text-to-Image Generation with Reliable Random Seeds

📅 2024-11-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Text-to-image diffusion models suffer from layout inconsistency when generating compositional prompts (e.g., “two dogs” or “a penguin to the right of a bowl”). This work identifies, for the first time, that the initial noise seed implicitly encodes compositional priors—such as viewpoint and spatial relationships—and proposes a fully automatic, annotation-free seed mining method. Our approach integrates an automated compositional consistency evaluator with synthetic data distillation to construct a self-supervised fine-tuning framework. Applied to Stable Diffusion and PixArt-α, our lightweight fine-tuning improves numerical compositional accuracy by 29.3% and 19.5%, respectively, and spatial compositional accuracy by 60.7% and 21.1%. The core contribution lies in revealing the compositional semantics embedded in noise seeds and establishing the first seed-driven paradigm for controllable compositional generation.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as"two dogs"or"a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{alpha}.
Problem

Research questions and friction points this paper is trying to address.

Improving text-to-image model consistency
Analyzing initial noise impact
Enhancing compositional image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Initial noise analysis
Reliable seed mining
Generated image fine-tuning
🔎 Similar Papers
No similar papers found.