🤖 AI Summary
Existing autoregressive text-to-image models face a trade-off: diffusion-based continuous token modeling incurs high computational cost, while vector-quantized discrete tokens introduce reconstruction artifacts. Method: This paper proposes a unified autoregressive framework that jointly models continuous image tokens and discrete text tokens—eliminating quantization entirely. It replaces the conventional decoder with a lightweight flow-matching head (157M parameters) and employs a 14B-parameter backbone, trained end-to-end via pure next-token prediction. Contribution/Results: By bypassing quantization, the approach avoids reconstruction distortion, enables high-resolution, high-fidelity image generation, and natively supports fine-grained image editing. Experiments demonstrate state-of-the-art performance across multiple autoregressive image generation benchmarks.
📝 Abstract
Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.