Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

๐Ÿ“… 2025-07-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the low token efficiency and insufficient reconstruction fidelity of visual tokenizers in autoregressive image generation. We propose VFMTok: a novel framework that (i) employs a frozen, pretrained vision foundation model as a fixed encoder; (ii) introduces region-adaptive quantization to compress redundant features; and (iii) replaces conventional pixel-level reconstruction with a semantic reconstruction loss, enabling high-fidelity class-conditional generation without classifier-free guidance (CFG). Our approach decouples tokenizer design from generative modeling, achieving both semantic consistency and computational efficiency. VFMTok significantly improves token efficiency and accelerates training convergence by 3ร—. On ImageNet, it achieves a gFID of 2.07โ€”outperforming existing autoregressive methods in generation quality. The core contribution lies in redefining visual tokenization as a semantics-aware, lightweight pre-processing step, thereby enhancing both fidelity and scalability of autoregressive image synthesis.

Technology Category

Application Category

๐Ÿ“ Abstract
Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.
Problem

Research questions and friction points this paper is trying to address.

Develop image tokenizer using vision foundation models
Enhance token efficiency and semantic fidelity
Improve autoregressive image generation quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using frozen vision foundation model as encoder
Introducing region-adaptive quantization framework
Adding semantic reconstruction objective for alignment
๐Ÿ”Ž Similar Papers
No similar papers found.