LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating multi-object scenes faces dual challenges: achieving semantic richness while ensuring spatial plausibility—diffusion models lack explicit spatial reasoning, whereas conventional robotic planning methods struggle to encode visual semantics. This paper introduces the first framework integrating vision-language agents with compositional diffusion models to jointly model semantic relationships and geometric constraints. Methodologically, a vision-language model performs image segmentation, object-scale estimation, scene graph construction, and prompt rewriting; these outputs guide a compositional diffusion model to generate semantically consistent spatial layouts (i.e., bounding boxes), which are then refined by a foreground-conditioned image generator to produce high-fidelity scenes. Experiments demonstrate significant improvements over state-of-the-art methods in layout coherence, physical plausibility, and aesthetic alignment, enabling high-quality synthesis of complex multi-object scenes.

Technology Category

Application Category

📝 Abstract
Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled high-quality image generation, they lack explicit spatial reasoning, leading to unrealistic object layouts. On the other hand, traditional spatial planning methods in robotics emphasize geometric and relational consistency, but they struggle to capture semantic richness in visual scenes. To bridge this gap, in this paper, we propose LayoutAgent, an agentic framework that unifies vision-language reasoning with compositional diffusion for layout generation. Given multiple input images with target objects in them, our method first employs visual-language model to preprocess the inputs through segmentation, object size estimation, scene graph construction, and prompt rewriting. Then we leverage compositional diffusion-a method traditionally used in robotics-to synthesize bounding boxes that respect object relations encoded in the scene graph for spatial layouts. In the end, a foreground-conditioned image generator composes the complete scene by rendering the objects into the planned layout guided by designed prompts. Experiments demonstrate that LayoutAgent outperforms other state-of-the-art layout generation models in layout coherence, spatial realism and aesthetic alignment.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic multi-object scenes with spatial layouts
Bridging diffusion models' image quality with spatial reasoning
Overcoming traditional methods' limitations in semantic richness capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision-language agent integrates compositional diffusion for layouts
Scene graph guides bounding box synthesis via diffusion
Foreground-conditioned generator renders objects into planned layouts
🔎 Similar Papers
No similar papers found.
Z
Zezhong Fan
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
Xiaohan Li
Xiaohan Li
Walmart Inc.
Data MiningRecommender systemMedical AI
Luyi Ma
Luyi Ma
Walmart
Recommender SystemRepresentation LearningSeasonalityUser modeling
K
Kai Zhao
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
L
Liang Peng
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
T
Topojoy Biswas
Personalization Team, Walmart Global Tech, Sunnyvale, California, USA
Evren Korpeoglu
Evren Korpeoglu
Walmart Global Tech
Machine learningRecommender systems
Kaushiki Nag
Kaushiki Nag
University of Minnesota , Twin Cities
Recommender SystemsMachine Learning
Kannan Achan
Kannan Achan
Walmartlabs
machine learningartificial intelligencegenerative modeling