🤖 AI Summary
In autoregressive image generation, the fixed raster-scan ordering violates semantic causality inherent in image content (e.g., cloud color depends on sun position and color temperature), leading to logically inconsistent generations. To address this, we propose a **content-driven, semantic-aware generation order learning framework**, the first to model the pixel generation sequence as a learnable latent variable. Our method jointly optimizes arbitrary-order autoregressive modeling and order distillation, enabling the sampling sequence to automatically reflect the image’s intrinsic causal structure. It requires no manual annotations or external supervision, supporting unsupervised order inference and conditional patch generation. Evaluated on two mainstream benchmarks, our approach significantly outperforms the raster-scan baseline—reducing FID by 12.3% and improving LPIPS by 8.7%—while maintaining comparable training overhead. This demonstrates both the effectiveness and feasibility of semantically ordered generation for enhancing image quality.
📝 Abstract
Autoregressive patch-based image generation has recently shown competitive results in terms of image quality and scalability. It can also be easily integrated and scaled within Vision-Language models. Nevertheless, autoregressive models require a defined order for patch generation. While a natural order based on the dictation of the words makes sense for text generation, there is no inherent generation order that exists for image generation. Traditionally, a raster-scan order (from top-left to bottom-right) guides autoregressive image generation models. In this paper, we argue that this order is suboptimal, as it fails to respect the causality of the image content: for instance, when conditioned on a visual description of a sunset, an autoregressive model may generate clouds before the sun, even though the color of clouds should depend on the color of the sun and not the inverse. In this work, we show that first by training a model to generate patches in any-given-order, we can infer both the content and the location (order) of each patch during generation. Secondly, we use these extracted orders to finetune the any-given-order model to produce better-quality images. Through our experiments, we show on two datasets that this new generation method produces better images than the traditional raster-scan approach, with similar training costs and no extra annotations.