Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This work addresses two key limitations of VAEs in image generation: the need for from-scratch training and their restricted semantic representational capacity. We propose a novel paradigm that repurposes a frozen, pretrained vision foundation encoder (e.g., ViT) as a continuous tokenizer for diffusion models. Our method employs a three-stage strategy: (1) freezing the encoder and introducing a lightweight adapter; (2) jointly optimizing the adapter and decoder under a semantic preservation loss to maintain latent-space structure; and (3) refining the decoder to enhance reconstruction fidelity. Crucially, this approach eliminates VAE training entirely, instead leveraging the high-level semantic priors embedded in foundation models to construct a more compact and semantically rich latent space. On ImageNet 256×256, our model achieves a gFID of 1.90 within just 64 training epochs. On LAION, a 2B-parameter instantiation significantly outperforms the VAE used in FLUX, demonstrating both efficiency and strong generalization.

Technology Category

Application Category

📝 Abstract

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$ imes$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

Problem

Research questions and friction points this paper is trying to address.

Align pretrained visual encoders as diffusion model tokenizers

Establish semantic latent space through three-stage alignment strategy

Improve image generation quality and accelerate diffusion model convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligning pretrained visual encoders as tokenizers

Three-stage strategy with semantic preservation loss

Semantically rich tokenizers accelerate diffusion convergence

🔎 Similar Papers

Towards Semantic Equivalence of Tokenization in Multimodal LLM