Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses two fundamental bottlenecks in autoregressive visual generation: the lack of an optimal 2D-structured tokenization scheme and exposure bias induced by teacher forcing. We propose xAR, a general-purpose framework with two core innovations: (1) the Next-X prediction paradigm—a novel continuous regression framework over multi-granular visual entities (e.g., patches, cells, subsampled regions, scales, or whole images), replacing discrete token classification; and (2) a noise-context learning mechanism integrated with flow matching, eliminating reliance on teacher forcing and substantially mitigating exposure bias. Experiments demonstrate that xAR-B (172M parameters) achieves superior FID to DiT-XL/SiT-XL (675M) on ImageNet-256 while accelerating inference by 20×. xAR-H sets a new state-of-the-art FID of 1.24, with 2.2× speedup and no dependence on external modules (e.g., DINOv2).

Technology Category

Application Category

📝 Abstract

Autoregressive (AR) modeling, known for its next-token prediction paradigm, underpins state-of-the-art language and visual generative models. Traditionally, a ``token'' is treated as the smallest prediction unit, often a discrete symbol in language or a quantized patch in vision. However, the optimal token definition for 2D image structures remains an open question. Moreover, AR models suffer from exposure bias, where teacher forcing during training leads to error accumulation at inference. In this paper, we propose xAR, a generalized AR framework that extends the notion of a token to an entity X, which can represent an individual patch token, a cell (a $k imes k$ grouping of neighboring patches), a subsample (a non-local grouping of distant patches), a scale (coarse-to-fine resolution), or even a whole image. Additionally, we reformulate discrete token classification as extbf{continuous entity regression}, leveraging flow-matching methods at each AR step. This approach conditions training on noisy entities instead of ground truth tokens, leading to Noisy Context Learning, which effectively alleviates exposure bias. As a result, xAR offers two key advantages: (1) it enables flexible prediction units that capture different contextual granularity and spatial structures, and (2) it mitigates exposure bias by avoiding reliance on teacher forcing. On ImageNet-256 generation benchmark, our base model, xAR-B (172M), outperforms DiT-XL/SiT-XL (675M) while achieving 20$ imes$ faster inference. Meanwhile, xAR-H sets a new state-of-the-art with an FID of 1.24, running 2.2$ imes$ faster than the previous best-performing model without relying on vision foundation modules (eg, DINOv2) or advanced guidance interval sampling.

Problem

Research questions and friction points this paper is trying to address.

Defining optimal token for 2D images

Mitigating exposure bias in AR models

Extending token to flexible entity X

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends token to entity X

Reformulates token classification

Mitigates exposure bias

🔎 Similar Papers

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation