Diffusion Autoencoders are Scalable Image Tokenizers

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing image tokenizers suffer from reliance on supervised pretraining, multi-objective loss balancing, and limited scalability. To address these issues, this paper proposes DiTo—a fully self-supervised diffusion-based image tokenizer. DiTo demonstrates for the first time that a single diffusion-based L2 reconstruction objective suffices to efficiently train compact, scalable visual representations—eliminating the need for external labels, auxiliary losses, or pretrained models. Its architecture jointly optimizes representational capacity and computational efficiency, with strong theoretical interpretability. Evaluated on image reconstruction and downstream generative tasks—including text-to-image synthesis—DiTo achieves or surpasses state-of-the-art performance while substantially reducing training complexity, enhancing generalization, and improving large-image handling capability.

Technology Category

Application Category

📝 Abstract

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.

Problem

Research questions and friction points this paper is trying to address.

Image Processing

Quality Improvement

Unsupervised Learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

DiTo

Diffusion L2 Loss

Unsupervised Image Processing

🔎 Similar Papers

Semantic Augmentation in Images using Language