🤖 AI Summary
Text-to-image diffusion models suffer from layout inconsistency when generating compositional prompts (e.g., “two dogs” or “a penguin to the right of a bowl”). This work identifies, for the first time, that the initial noise seed implicitly encodes compositional priors—such as viewpoint and spatial relationships—and proposes a fully automatic, annotation-free seed mining method. Our approach integrates an automated compositional consistency evaluator with synthetic data distillation to construct a self-supervised fine-tuning framework. Applied to Stable Diffusion and PixArt-α, our lightweight fine-tuning improves numerical compositional accuracy by 29.3% and 19.5%, respectively, and spatial compositional accuracy by 60.7% and 21.1%. The core contribution lies in revealing the compositional semantics embedded in noise seeds and establishing the first seed-driven paradigm for controllable compositional generation.
📝 Abstract
Text-to-image diffusion models have demonstrated remarkable capability in generating realistic images from arbitrary text prompts. However, they often produce inconsistent results for compositional prompts such as"two dogs"or"a penguin on the right of a bowl". Understanding these inconsistencies is crucial for reliable image generation. In this paper, we highlight the significant role of initial noise in these inconsistencies, where certain noise patterns are more reliable for compositional prompts than others. Our analyses reveal that different initial random seeds tend to guide the model to place objects in distinct image areas, potentially adhering to specific patterns of camera angles and image composition associated with the seed. To improve the model's compositional ability, we propose a method for mining these reliable cases, resulting in a curated training set of generated images without requiring any manual annotation. By fine-tuning text-to-image models on these generated images, we significantly enhance their compositional capabilities. For numerical composition, we observe relative increases of 29.3% and 19.5% for Stable Diffusion and PixArt-{alpha}, respectively. Spatial composition sees even larger gains, with 60.7% for Stable Diffusion and 21.1% for PixArt-{alpha}.