SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

📅 2024-12-14
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
To address the inefficiency and severe semantic degradation in image tokenization under high compression ratios, this paper proposes the Soft Vector Quantization Variational Autoencoder (Soft VQ-VAE), enabling continuous, differentiable, and high-capacity latent space modeling. Its core innovation is a novel soft categorical posterior aggregation mechanism: multiple codebook entries are adaptively weighted to produce a single 1D continuous token, compressing 256×256 and 512×512 images into only 32 and 64 tokens, respectively—striking a balance between reconstruction fidelity and semantic richness in generation. The method is fully end-to-end differentiable and natively compatible with Transformer architectures. Experiments demonstrate an 18× and 55× speedup in image generation inference throughput (for 256² and 512² resolutions), achieving FID scores of 1.78 and 2.21, respectively, while reducing training iterations by 2.3×. Code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract
Efficient image tokenization with high compression ratios remains a critical challenge for training generative models. We present SoftVQ-VAE, a continuous image tokenizer that leverages soft categorical posteriors to aggregate multiple codewords into each latent token, substantially increasing the representation capacity of the latent space. When applied to Transformer-based architectures, our approach compresses 256x256 and 512x512 images using as few as 32 or 64 1-dimensional tokens. Not only does SoftVQ-VAE show consistent and high-quality reconstruction, more importantly, it also achieves state-of-the-art and significantly faster image generation results across different denoising-based generative models. Remarkably, SoftVQ-VAE improves inference throughput by up to 18x for generating 256x256 images and 55x for 512x512 images while achieving competitive FID scores of 1.78 and 2.21 for SiT-XL. It also improves the training efficiency of the generative models by reducing the number of training iterations by 2.3x while maintaining comparable performance. With its fully-differentiable design and semantic-rich latent space, our experiment demonstrates that SoftVQ-VAE achieves efficient tokenization without compromising generation quality, paving the way for more efficient generative models. Code and model are released.
Problem

Research questions and friction points this paper is trying to address.

Efficient image tokenization with high compression ratios.
Improving inference throughput and training efficiency for generative models.
Achieving state-of-the-art image generation quality and speed.
Innovation

Methods, ideas, or system contributions that make the work stand out.

SoftVQ-VAE uses soft categorical posteriors for tokenization.
Achieves high compression with 1D tokens for images.
Improves inference throughput and training efficiency significantly.
🔎 Similar Papers
No similar papers found.