Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

📅 2025-01-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the longstanding trade-off between reconstruction fidelity and computational efficiency in vision autoencoders for image/video generation. We propose ViTok, a lightweight, cross-modal visual tokenizer built upon an enhanced Vision Transformer architecture. For the first time, we systematically characterize the asymmetric impact of encoder–decoder scaling, revealing that decoder scaling contributes more significantly to generative performance than encoder scaling. By jointly optimizing bottleneck dimensionality and encoder–decoder architecture, ViTok achieves high-fidelity reconstruction at low FLOPs. On ImageNet-1K and COCO, ViTok sets new state-of-the-art (SOTA) reconstruction performance. In video reconstruction on UCF-101, it surpasses prior baselines with 2–5× lower computational cost. Furthermore, when integrated into a Diffusion Transformer, ViTok enables SOTA class-conditional video generation on UCF-101.

Technology Category

Application Category

📝 Abstract
Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.
Problem

Research questions and friction points this paper is trying to address.

Autoencoder Design
Image and Video Generation
Visual Quality Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Transformer
Enhanced Decoder
Efficient Image and Video Reconstruction
🔎 Similar Papers
No similar papers found.
P
Philippe Hansen-Estruch
UT Austin, GenAI, Meta
D
David Yan
GenAI, Meta
C
Ching-Yao Chung
GenAI, Meta
Orr Zohar
Orr Zohar
Stanford University
Large Multimodal ModelsFoundation ModelsVision-Language Models
Jialiang Wang
Jialiang Wang
Research Scientist, Meta AI
Computer VisionGenerative AI
Tingbo Hou
Tingbo Hou
Google DeepMind
Computer VisionGenerative AI
T
Tao Xu
GenAI, Meta
S
S. Vishwanath
UT Austin
P
Peter Vajda
GenAI, Meta
X
Xinlei Chen
FAIR, Meta