🤖 AI Summary
Generating multi-object scenes faces dual challenges: achieving semantic richness while ensuring spatial plausibility—diffusion models lack explicit spatial reasoning, whereas conventional robotic planning methods struggle to encode visual semantics. This paper introduces the first framework integrating vision-language agents with compositional diffusion models to jointly model semantic relationships and geometric constraints. Methodologically, a vision-language model performs image segmentation, object-scale estimation, scene graph construction, and prompt rewriting; these outputs guide a compositional diffusion model to generate semantically consistent spatial layouts (i.e., bounding boxes), which are then refined by a foreground-conditioned image generator to produce high-fidelity scenes. Experiments demonstrate significant improvements over state-of-the-art methods in layout coherence, physical plausibility, and aesthetic alignment, enabling high-quality synthesis of complex multi-object scenes.
📝 Abstract
Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.