Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

129K/year

🤖 AI Summary

Existing text-to-layout generation methods are constrained by closed vocabularies or reliance on proprietary large language models (LLMs), resulting in poor controllability and limited generalization. This paper introduces OpenLayout: a lightweight, open-source, open-vocabulary framework for text-to-layout generation. Methodologically, it employs an open-source small language model for fine-grained element parsing and proposes an aspect-ratio-aware diffusion Transformer architecture that jointly models spatial relationships and semantic alignment. End-to-end training enables high-precision spatial localization and numerical reasoning. OpenLayout achieves state-of-the-art performance across multiple spatial–numerical reasoning benchmarks. It further supports layout-guided image editing and an LLM coarse-initialization + fine-tuning paradigm, significantly enhancing both generation quality and user controllability.

Technology Category

Application Category

📝 Abstract

We present Lay-Your-Scene (shorthand LayouSyn), a novel text-to-layout generation pipeline for natural scenes. Prior scene layout generation methods are either closed-vocabulary or use proprietary large language models for open-vocabulary generation, limiting their modeling capabilities and broader applicability in controllable image generation. In this work, we propose to use lightweight open-source language models to obtain scene elements from text prompts and a novel aspect-aware diffusion Transformer architecture trained in an open-vocabulary manner for conditional layout generation. Extensive experiments demonstrate that LayouSyn outperforms existing methods and achieves state-of-the-art performance on challenging spatial and numerical reasoning benchmarks. Additionally, we present two applications of LayouSyn. First, we show that coarse initialization from large language models can be seamlessly combined with our method to achieve better results. Second, we present a pipeline for adding objects to images, demonstrating the potential of LayouSyn in image editing applications.

Problem

Research questions and friction points this paper is trying to address.

Open-vocabulary natural scene layout generation

Overcoming limitations of closed-vocabulary and proprietary models

Enhancing controllable image generation with lightweight language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight open-source language models for scene elements

Aspect-aware diffusion Transformer for layout generation

Open-vocabulary training for conditional layout generation

🔎 Similar Papers

LT3SD: Latent Trees for 3D Scene Diffusion