NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

📅 2025-08-14

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Existing autoregressive text-to-image models face a trade-off: diffusion-based continuous token modeling incurs high computational cost, while vector-quantized discrete tokens introduce reconstruction artifacts. Method: This paper proposes a unified autoregressive framework that jointly models continuous image tokens and discrete text tokens—eliminating quantization entirely. It replaces the conventional decoder with a lightweight flow-matching head (157M parameters) and employs a 14B-parameter backbone, trained end-to-end via pure next-token prediction. Contribution/Results: By bypassing quantization, the approach avoids reconstruction distortion, enables high-resolution, high-fidelity image generation, and natively supports fine-grained image editing. Experiments demonstrate state-of-the-art performance across multiple autoregressive image generation benchmarks.

Technology Category

Application Category

📝 Abstract

Prevailing autoregressive (AR) models for text-to-image generation either rely on heavy, computationally-intensive diffusion models to process continuous image tokens, or employ vector quantization (VQ) to obtain discrete tokens with quantization loss. In this paper, we push the autoregressive paradigm forward with NextStep-1, a 14B autoregressive model paired with a 157M flow matching head, training on discrete text tokens and continuous image tokens with next-token prediction objectives. NextStep-1 achieves state-of-the-art performance for autoregressive models in text-to-image generation tasks, exhibiting strong capabilities in high-fidelity image synthesis. Furthermore, our method shows strong performance in image editing, highlighting the power and versatility of our unified approach. To facilitate open research, we will release our code and models to the community.

Problem

Research questions and friction points this paper is trying to address.

Overcoming heavy computational demands in autoregressive image generation

Reducing quantization loss in discrete token-based image synthesis

Enhancing fidelity and versatility in text-to-image generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

14B autoregressive model with 157M flow matching

Training on discrete text and continuous image tokens

State-of-the-art autoregressive text-to-image generation

🔎 Similar Papers

Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining