๐ค AI Summary
This work addresses the low token efficiency and insufficient reconstruction fidelity of visual tokenizers in autoregressive image generation. We propose VFMTok: a novel framework that (i) employs a frozen, pretrained vision foundation model as a fixed encoder; (ii) introduces region-adaptive quantization to compress redundant features; and (iii) replaces conventional pixel-level reconstruction with a semantic reconstruction loss, enabling high-fidelity class-conditional generation without classifier-free guidance (CFG). Our approach decouples tokenizer design from generative modeling, achieving both semantic consistency and computational efficiency. VFMTok significantly improves token efficiency and accelerates training convergence by 3ร. On ImageNet, it achieves a gFID of 2.07โoutperforming existing autoregressive methods in generation quality. The core contribution lies in redefining visual tokenization as a semantics-aware, lightweight pre-processing step, thereby enhancing both fidelity and scalability of autoregressive image synthesis.
๐ Abstract
Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.