NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work proposes NextFlow, a decoder-only unified autoregressive Transformer that addresses the limitations of existing autoregressive multimodal models in image generation speed and cross-modal alignment. Trained on 6 trillion interleaved discrete text-image tokens, NextFlow achieves native multimodal understanding and generation. Its key innovations include replacing raster scanning with a “next-scale prediction” strategy to dramatically accelerate high-resolution image synthesis, alongside a multi-scale training stabilization approach and a prefix-tuning-based reinforcement learning mechanism. The model generates 1024×1024 images in under five seconds—significantly faster than comparable autoregressive models—while attaining state-of-the-art visual quality among unified architectures and matching the performance of specialized diffusion models.

Technology Category

Application Category

📝 Abstract

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

autoregressive modeling

image generation

text-image alignment

unified architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified autoregressive modeling

next-scale prediction

multimodal generation