Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of achieving efficient and high-fidelity tokenization for image generation. We propose FlowMo, a diffusion-based autoencoder built entirely upon the Transformer architecture—eliminating convolutions, adversarial losses, spatially aligned latent codes, and knowledge distillation. Our key contributions are threefold: (1) a novel two-stage training paradigm comprising mode-matching pretraining followed by mode-search fine-tuning; (2) flow matching to model the latent distribution, coupled with a perception-oriented reconstruction objective; and (3) native support for multiple compression ratios. On ImageNet-1K reconstruction, FlowMo establishes a new state-of-the-art for tokenization—achieved without distillation or adversarial training—significantly outperforming baselines including VQGAN and LDM. Both reconstruction fidelity and downstream generative quality are substantially improved.

Technology Category

Application Category

📝 Abstract
Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .
Problem

Research questions and friction points this paper is trying to address.

Improves image tokenization performance without convolutions or adversarial losses.
Introduces a two-stage training approach for diffusion autoencoders.
Achieves state-of-the-art ImageNet-1K reconstruction at multiple compression rates.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based diffusion autoencoder for image tokenization
Mode-matching pre-training and mode-seeking post-training stages
Achieves state-of-the-art without convolutions or adversarial losses
🔎 Similar Papers
No similar papers found.
K
Kyle Sargent
Stanford University
Kyle Hsu
Kyle Hsu
Stanford University
artificial intelligencemachine learningrobotics
J
Justin Johnson
University of Michigan
F
Fei-Fei Li
Stanford University
J
Jiajun Wu
Stanford University