SLayR: Scene Layout Generation with Rectified Flow

📅 2024-12-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-image generation methods struggle to balance layout plausibility and diversity under ambiguous prompts. This paper proposes SLayR—the first Rectified Flow-based Transformer model that directly generates scene layouts in discrete token space, producing decodable bounding boxes and semantic labels, and enabling end-to-end co-generation with layout-to-image models. Key contributions include: (i) the first adaptation of Rectified Flow to layout generation; (ii) the first comprehensive, reproducible benchmark for layout quality evaluation, incorporating human assessment; and (iii) state-of-the-art performance in both layout plausibility and diversity. Compared to baselines, SLayR reduces model parameters by 5×, accelerates inference by 37%, and yields interpretable, editable outputs—significantly enhancing fine-grained controllability in text-to-image synthesis.

Technology Category

Application Category

📝 Abstract
We introduce SLayR, Scene Layout Generation with Rectified flow. State-of-the-art text-to-image models achieve impressive results. However, they generate images end-to-end, exposing no fine-grained control over the process. SLayR presents a novel transformer-based rectified flow model for layout generation over a token space that can be decoded into bounding boxes and corresponding labels, which can then be transformed into images using existing models. We show that established metrics for generated images are inconclusive for evaluating their underlying scene layout, and introduce a new benchmark suite, including a carefully designed repeatable human-evaluation procedure that assesses the plausibility and variety of generated layouts. In contrast to previous works, which perform well in either high variety or plausibility, we show that our approach performs well on both of these axes at the same time. It is also at least 5x times smaller in the number of parameters and 37% faster than the baselines. Our complete text-to-image pipeline demonstrates the added benefits of an interpretable and editable intermediate representation.
Problem

Research questions and friction points this paper is trying to address.

Generates diverse and plausible scene layouts from ambiguous text prompts.
Improves text-to-image pipelines by enhancing layout generation quality.
Introduces a new benchmark for evaluating layout plausibility and variety.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based text-to-layout generation model
New benchmark suite for layout evaluation
Smaller model size with superior performance
🔎 Similar Papers
No similar papers found.