🤖 AI Summary
This work addresses the limitations of existing 1D visual tokenizers in autoregressive image generation, which adopt language modeling paradigms while neglecting the hierarchical structure and residual characteristics inherent in visual data, thereby constraining representational capacity. To overcome this, the authors propose ResTok, the first 1D visual tokenizer that integrates a hierarchical residual architecture. By enabling cross-level feature fusion and semantic residual modeling, ResTok achieves implicit hierarchical binding without explicit constraints and supports fully parallel layer-wise prediction to accelerate generation. Evaluated on ImageNet-256, ResTok attains a gFID of 2.34 with only nine sampling steps, significantly outperforming current methods and demonstrating superior modeling efficiency and generation quality.
📝 Abstract
Existing 1D visual tokenizers for autoregressive (AR) generation largely follow the design principles of language modeling, as they are built directly upon transformers whose priors originate in language, yielding single-hierarchy latent tokens and treating visual data as flat sequential token streams. However, this language-like formulation overlooks key properties of vision, particularly the hierarchical and residual network designs that have long been essential for convergence and efficiency in visual models. To bring"vision"back to vision, we propose the Residual Tokenizer (ResTok), a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens. The hierarchical representations obtained through progressively merging enable cross-level feature fusion at each layer, substantially enhancing representational capacity. Meanwhile, the semantic residuals between hierarchies prevent information overlap, yielding more concentrated latent distributions that are easier for AR modeling. Cross-level bindings consequently emerge without any explicit constraints. To accelerate the generation process, we further introduce a hierarchical AR generator that substantially reduces sampling steps by predicting an entire level of latent tokens at once rather than generating them strictly token-by-token. Extensive experiments demonstrate that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps. Code is available at https://github.com/Kwai-Kolors/ResTok.