Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

📅 2025-09-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address layout-description misalignment in autoregressive (AR) text-to-image generation—caused by sparse layout conditioning, weak spatial constraint modeling, and feature entanglement—this paper proposes a layout-aware AR generation framework. Methodologically, it introduces (1) a structured attention masking mechanism that explicitly encodes inter-region spatial relationships to suppress cross-region semantic confusion, and (2) a post-training framework based on Groupwise Relative Policy Optimization (GRPO), jointly optimizing layout alignment and image quality rewards for fine-grained control. Without compromising generation fidelity, the method significantly improves layout accuracy (+12.7% BoxIoU) and structural consistency, achieving state-of-the-art performance on both COCO-Stuff and Layout-Guided benchmarks.

Technology Category

Application Category

📝 Abstract

While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.

Problem

Research questions and friction points this paper is trying to address.

Integrating layout constraints into autoregressive image generation models

Preventing feature entanglement between layout and image tokens

Maintaining generation quality while enabling layout-aware control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured masking strategy for attention computation

Group Relative Policy Optimization post-training scheme

Integration of layout tokens with text tokens

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining