Vision Foundation Models as Effective Visual Tokenizers for Autoregressive Image Generation

📅 2025-07-11

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the low token efficiency and insufficient reconstruction fidelity of visual tokenizers in autoregressive image generation. We propose VFMTok: a novel framework that (i) employs a frozen, pretrained vision foundation model as a fixed encoder; (ii) introduces region-adaptive quantization to compress redundant features; and (iii) replaces conventional pixel-level reconstruction with a semantic reconstruction loss, enabling high-fidelity class-conditional generation without classifier-free guidance (CFG). Our approach decouples tokenizer design from generative modeling, achieving both semantic consistency and computational efficiency. VFMTok significantly improves token efficiency and accelerates training convergence by 3×. On ImageNet, it achieves a gFID of 2.07—outperforming existing autoregressive methods in generation quality. The core contribution lies in redefining visual tokenization as a semantics-aware, lightweight pre-processing step, thereby enhancing both fidelity and scalability of autoregressive image synthesis.

Technology Category

Application Category

📝 Abstract

Leveraging the powerful representations of pre-trained vision foundation models -- traditionally used for visual comprehension -- we explore a novel direction: building an image tokenizer directly atop such models, a largely underexplored area. Specifically, we employ a frozen vision foundation model as the encoder of our tokenizer. To enhance its effectiveness, we introduce two key components: (1) a region-adaptive quantization framework that reduces redundancy in the pre-trained features on regular 2D grids, and (2) a semantic reconstruction objective that aligns the tokenizer's outputs with the foundation model's representations to preserve semantic fidelity. Based on these designs, our proposed image tokenizer, VFMTok, achieves substantial improvements in image reconstruction and generation quality, while also enhancing token efficiency. It further boosts autoregressive (AR) generation -- achieving a gFID of 2.07 on ImageNet benchmarks, while accelerating model convergence by three times, and enabling high-fidelity class-conditional synthesis without the need for classifier-free guidance (CFG). The code will be released publicly to benefit the community.

Problem

Research questions and friction points this paper is trying to address.

Develop image tokenizer using vision foundation models

Enhance token efficiency and semantic fidelity

Improve autoregressive image generation quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using frozen vision foundation model as encoder

Introducing region-adaptive quantization framework

Adding semantic reconstruction objective for alignment

🔎 Similar Papers

No similar papers found.