NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

📅 2026-01-05
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes NextFlow, a decoder-only unified autoregressive Transformer that addresses the limitations of existing autoregressive multimodal models in image generation speed and cross-modal alignment. Trained on 6 trillion interleaved discrete text-image tokens, NextFlow achieves native multimodal understanding and generation. Its key innovations include replacing raster scanning with a “next-scale prediction” strategy to dramatically accelerate high-resolution image synthesis, alongside a multi-scale training stabilization approach and a prefix-tuning-based reinforcement learning mechanism. The model generates 1024×1024 images in under five seconds—significantly faster than comparable autoregressive models—while attaining state-of-the-art visual quality among unified architectures and matching the performance of specialized diffusion models.

Technology Category

Application Category

📝 Abstract
We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.
Problem

Research questions and friction points this paper is trying to address.

multimodal understanding
autoregressive modeling
image generation
text-image alignment
unified architecture
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified autoregressive modeling
next-scale prediction
multimodal generation
discrete token representation
prefix-tuning for RL
🔎 Similar Papers
No similar papers found.
Huichao Zhang
Huichao Zhang
Shanghai Jiaotong University
3dcomputer visionVLM
Liao Qu
Liao Qu
Carnegie Mellon University, ByteDance
Computer VisionMLLM
Y
Yiheng Liu
ByteDance
H
Hang Chen
ByteDance
Y
Yangyang Song
ByteDance
Y
Yongsheng Dong
ByteDance
Shikun Sun
Shikun Sun
Tsinghua University, Cornell University
Machine LearningGenerative Model
X
Xian Li
ByteDance
X
Xu Wang
ByteDance
Yi Jiang
Yi Jiang
Bytedance
Generative ModelsLarge Language ModelComputer Vision
Hu Ye
Hu Ye
ByteDance,Tencent
AIGC
Bo Chen
Bo Chen
ByteDance
Computer Vision
Y
Yiming Gao
ByteDance
Peng Liu
Peng Liu
ByteDance
mllmDomain adaptation
Akide Liu
Akide Liu
PhD Student @ Monash University
Efficient AIComputer Vision
Z
Zhipeng Yang
ByteDance
Q
Qili Deng
ByteDance
L
Linjie Xing
ByteDance
J
Jiyang Liu
ByteDance
Z
Zhao Wang
ByteDance
Y
Yang Zhou
ByteDance
Mingcong Liu
Mingcong Liu
ByteDance Inc.
Computer VisionDeep LearningImage EnhancementInfrared Image Processing
Y
Yi Zhang
ByteDance
Qian He
Qian He
ByteDance
X
Xiwei Hu
ByteDance
Z
Zhongqi Qi
ByteDance
Jie Shao
Jie Shao
Professor, University of Electronic Science and Technology of China
MultimediaDatabase
Z
Zhiye Fu
ByteDance
S
Shuai Wang
ByteDance
F
Fan-Fan Chen
ByteDance
X
Xuezhi Chai
ByteDance
Z
Zhihua Wu
ByteDance
Yitong Wang
Yitong Wang
ByteDance Inc.
computer vision
Zehuan Yuan
Zehuan Yuan
Bytedance Inc.
Computer VisionMultimediaMachine Learning
Daniel K. Du
Daniel K. Du
Bytedance Intelligent Creation
Artificial IntelligenceComputer VersionCombinatorics
Xinglong Wu
Xinglong Wu
字节跳动算法工程师
人工智能