🤖 AI Summary
This paper addresses the disconnection between layout planning and pixel-level synthesis in image generation. We propose PlanGen—the first framework unifying layout planning and image generation within a single autoregressive vision-language Transformer. Its core contributions are: (1) a text-layout-image joint sequence modeling paradigm, enabling implicit pre-planning of layout conditions via pure next-token prediction; (2) elimination of explicit bounding-box encoding or local descriptions, supporting multi-task joint training—including planning, generation, understanding, and editing—under unified prompting; and (3) novel mechanisms for in-context layout conditioning, teacher-forced content manipulation, and negative layout guidance. Experiments demonstrate that PlanGen significantly outperforms diffusion-based baselines across layout planning, layout-to-image generation, layout understanding, and editing tasks, validating the effectiveness and generalizability of autoregressive unified modeling.
📝 Abstract
In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and layout-to-image as two separate models, PlanGen jointly models the two tasks into one autoregressive transformer using only next-token prediction. PlanGen integrates layout conditions into the model as context without requiring specialized encoding of local captions and bounding box coordinates, which provides significant advantages over the previous embed-and-pool operations on layout conditions, particularly when dealing with complex layouts. Unified prompting allows PlanGen to perform multitasking training related to layout, including layout planning, layout-to-image generation, image layout understanding, etc. In addition, PlanGen can be seamlessly expanded to layout-guided image manipulation thanks to the well-designed modeling, with teacher-forcing content manipulation policy and negative layout guidance. Extensive experiments verify the effectiveness of our PlanGen in multiple layoutrelated tasks, showing its great potential. Code is available at: https://360cvgroup.github.io/PlanGen.