🤖 AI Summary
Diffusion models exhibit instability in multi-object generation tasks, yet the underlying causes remain unclear. This work introduces Mosaic, a controllable dataset designed to disentangle factors such as scene complexity, concept imbalance, and missing combinations, enabling systematic evaluation of diffusion models’ capabilities in both concept and compositional generalization. Through controlled text-to-image generation, distributional analysis, and generalization assessment, the study reveals that scene complexity is the primary driver of multi-object generation failures, counting tasks are especially challenging under low-data conditions, and compositional generalization deteriorates significantly as the proportion of unseen combinations in training increases. The proposed benchmark offers reproducible insights to advance the understanding and improvement of multi-object image generation.
📝 Abstract
Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.