PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

138K/year

🤖 AI Summary

This paper addresses the disconnection between layout planning and pixel-level synthesis in image generation. We propose PlanGen—the first framework unifying layout planning and image generation within a single autoregressive vision-language Transformer. Its core contributions are: (1) a text-layout-image joint sequence modeling paradigm, enabling implicit pre-planning of layout conditions via pure next-token prediction; (2) elimination of explicit bounding-box encoding or local descriptions, supporting multi-task joint training—including planning, generation, understanding, and editing—under unified prompting; and (3) novel mechanisms for in-context layout conditioning, teacher-forced content manipulation, and negative layout guidance. Experiments demonstrate that PlanGen significantly outperforms diffusion-based baselines across layout planning, layout-to-image generation, layout understanding, and editing tasks, validating the effectiveness and generalizability of autoregressive unified modeling.

Technology Category

Application Category

📝 Abstract

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.

Problem

Research questions and friction points this paper is trying to address.

Unified layout planning and image generation in autoregressive models.

Integration of layout conditions without specialized encoding.

Multitasking training for layout-related tasks and image manipulation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified layout planning and image generation model

Autoregressive transformer with next-token prediction

Seamless layout-guided image manipulation capability

🔎 Similar Papers

Rethinking The Training And Evaluation of Rich-Context Layout-to-Image Generation