Masked Autoencoders Are Effective Tokenizers for Diffusion Models

📅 2025-02-05

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work investigates how latent-space structure affects generation quality and efficiency in latent diffusion models. We propose MAETok—a deterministic tokenizer based on masked autoencoding (MAE)—which eliminates the variational constraints of traditional VAEs. Our key finding is that a low-dimensional, highly discriminative latent space is more critical for generation than sophisticated prior modeling. MAETok achieves state-of-the-art performance on ImageNet (gFID = 1.69 for 512×512 images) using only 128-dimensional tokens. It accelerates training by 76× and improves inference throughput by 31× compared to VAE-based baselines. The method integrates masked modeling, latent-space structural analysis, and end-to-end diffusion optimization to jointly optimize semantic expressiveness and reconstruction fidelity. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512x512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

Problem

Research questions and friction points this paper is trying to address.

Improving latent space structure for diffusion models

Enhancing image synthesis with autoencoder tokenizers

Achieving state-of-the-art ImageNet generation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

MAETok leverages mask modeling

Discriminative latent space improves generation

Achieves state-of-the-art ImageNet performance

🔎 Similar Papers

TEncDM: Understanding the Properties of the Diffusion Model in the Space of Language Model Encodings