🤖 AI Summary
Existing vision autoregressive models rely on raster-scan ordering for “next-token prediction,” neglecting the intrinsic spatial-temporal locality of visual data. This work proposes Neighborhood Autoregressive Modeling (NAR), reformulating generation as a progressive outpainting process ordered by increasing Manhattan distance from a seed region. It introduces a novel “next-neighborhood prediction” mechanism and a multi-dimensional orthogonal decoding head, enabling parallel prediction of spatial-temporal neighborhood tokens and drastically reducing the number of generation steps. By breaking away from conventional sequential modeling paradigms, NAR achieves 2.4× and 8.6× throughput improvements on ImageNet and UCF101, respectively, while attaining superior FID/FVD scores. A compact 0.8B-parameter model surpasses Chameleon-7B on GenEval and reduces required training data by 60% (to 40% of the baseline).
📝 Abstract
Visual autoregressive models typically adhere to a raster-order ``next-token prediction"paradigm, which overlooks the spatial and temporal locality inherent in visual content. Specifically, visual tokens exhibit significantly stronger correlations with their spatially or temporally adjacent tokens compared to those that are distant. In this paper, we propose Neighboring Autoregressive Modeling (NAR), a novel paradigm that formulates autoregressive visual generation as a progressive outpainting procedure, following a near-to-far ``next-neighbor prediction"mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance from the initial token in the spatial-temporal space, progressively expanding the boundary of the decoded region. To enable parallel prediction of multiple adjacent tokens in the spatial-temporal space, we introduce a set of dimension-oriented decoding heads, each predicting the next token along a mutually orthogonal dimension. During inference, all tokens adjacent to the decoded tokens are processed in parallel, substantially reducing the model forward steps for generation. Experiments on ImageNet$256 imes 256$ and UCF101 demonstrate that NAR achieves 2.4$ imes$ and 8.6$ imes$ higher throughput respectively, while obtaining superior FID/FVD scores for both image and video generation tasks compared to the PAR-4X approach. When evaluating on text-to-image generation benchmark GenEval, NAR with 0.8B parameters outperforms Chameleon-7B while using merely 0.4 of the training data. Code is available at https://github.com/ThisisBillhe/NAR.