🤖 AI Summary
Autoregressive (AR) image generation suffers from limited quality due to its reliance on single-step causal decoding, lacking mechanisms for post-hoc refinement. To address this, we propose *next-tensor prediction*, a novel paradigm that iteratively refines previously generated content by predicting overlapping discrete tensor blocks via a sliding window. We introduce a codebook-index-driven discrete noising scheme and strictly causal masking during training to prevent information leakage. Our approach is implemented as a plug-and-play module compatible with mainstream AR architectures—including LlamaGEN and RAR—operating atop VQ tokenizers with a learned token-to-tensor mapping. Quantitative evaluation shows substantial improvements: FID decreases by 12.3%, CLIP Score increases by 4.1, and human preference scores rise significantly. Generated images exhibit richer fine-grained details and enhanced structural consistency.
📝 Abstract
Autoregressive (AR) image generators offer a language-model-friendly approach to image generation by predicting discrete image tokens in a causal sequence. However, unlike diffusion models, AR models lack a mechanism to refine previous predictions, limiting their generation quality. In this paper, we introduce TensorAR, a new AR paradigm that reformulates image generation from next-token prediction to next-tensor prediction. By generating overlapping windows of image patches (tensors) in a sliding fashion, TensorAR enables iterative refinement of previously generated content. To prevent information leakage during training, we propose a discrete tensor noising scheme, which perturbs input tokens via codebook-indexed noise. TensorAR is implemented as a plug-and-play module compatible with existing AR models. Extensive experiments on LlamaGEN, Open-MAGVIT2, and RAR demonstrate that TensorAR significantly improves the generation performance of autoregressive models.