Continuous First, Discrete Later: VQ-VAEs Without Dimensional Collapse

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the dimensional collapse of latent representations in Vector Quantized Variational Autoencoders (VQ-VAEs), where vector quantization often confines the latent space to a low-dimensional subspace, degrading both reconstruction fidelity and perceptual quality. The paper provides the first mechanistic explanation of this phenomenon through the lenses of rate–distortion theory and sequential learning dynamics. To mitigate this issue without altering the model architecture, the authors propose an AE Warm-Up strategy: pretraining the encoder–decoder as a continuous autoencoder prior to introducing quantization, thereby preserving high-dimensional latent structure. Evaluated on both image (VQGAN) and audio (WavTokenizer) tasks, this approach substantially increases the effective latent dimensionality—from 3–5 to 17–19—while reducing rFID by 17–35% and improving PESQ by 11–14%, all without additional training cost.

📝 Abstract

While many approaches to improve VQ-VAE performance focus on codebook size and utilization, the effect of dimensional collapse, where trained VQ-VAE representations live in an extremely low-dimensional subspace (1-2% of full rank), remains unaddressed. We show theoretically and empirically that dimension collapse causes a hard loss lower bound that various codebook improvement techniques fail to surpass. Our analytic framework extends the sequential learning effect of Saxe et al. [2014] by introducing ideas from rate-distortion theory and explains how the latent collapse is caused by the VQ suppressing lower-variance directions. Our theory justifies a simple solution: a "warm-up phase" that trains the model as an (unquantized) autoencoder before introducing VQ. On both synthetic experiments and large-scale image (VQGAN) and audio (WavTokenizer) VQ-VAEs, we show that AE Warm-Up successfully restores representation dimension, leading to lower reconstruction and perceptual loss at the same training budget. Across codebook sizes $K \in$ {$2^{10}, 2^{14}, 2^{16}$}, AE warm-up raises VQGAN codebook effective dimension from 3-5 to 17-19 and reduces rFID by 17-35%; on WavTokenizer at $K \in$ {$2^{13}, 2^{14}$}, it raises codebook dimension from 4 to 17-19 and improves PESQ by 11-14%. We empirically characterize how warm-up duration governs the achievable final loss. In agreement with experiment, our theoretical analysis predicts downstream performance as a function of warm-up length, enabling an adaptive criterion for switching from AE Warm-up to VQ-VAE training.

Problem

Research questions and friction points this paper is trying to address.

dimensional collapse

VQ-VAE

representation learning

codebook

latent space

Innovation

Methods, ideas, or system contributions that make the work stand out.

dimensional collapse

VQ-VAE

AE warm-up