🤖 AI Summary
This work proposes NextFlow, a decoder-only unified autoregressive Transformer that addresses the limitations of existing autoregressive multimodal models in image generation speed and cross-modal alignment. Trained on 6 trillion interleaved discrete text-image tokens, NextFlow achieves native multimodal understanding and generation. Its key innovations include replacing raster scanning with a “next-scale prediction” strategy to dramatically accelerate high-resolution image synthesis, alongside a multi-scale training stabilization approach and a prefix-tuning-based reinforcement learning mechanism. The model generates 1024×1024 images in under five seconds—significantly faster than comparable autoregressive models—while attaining state-of-the-art visual quality among unified architectures and matching the performance of specialized diffusion models.
📝 Abstract
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.