🤖 AI Summary
Vector quantization (VQ) in unsupervised learning often suffers from representation collapse, leading to low codebook utilization and degenerate latent spaces, thereby limiting model scalability. This paper proposes SimVQ, a novel VQ variant that reparameterizes the entire codebook via a learnable linear layer, shifting the optimization objective from selecting a single nearest-codebook vector to projecting onto the linear subspace spanned by the codebook. Designed through rigorous theoretical analysis, SimVQ integrates seamlessly into standard VQ frameworks without requiring auxiliary regularization or dimensionality reduction. Evaluated on multimodal image and audio tasks, SimVQ introduces only a lightweight linear transformation yet achieves substantial improvements in codebook utilization and downstream performance while effectively mitigating collapse. The implementation is publicly available.
📝 Abstract
Vector Quantization (VQ) is a widely used method for converting continuous representations into discrete codes, which has become fundamental in unsupervised representation learning and latent generative models. However, VQ models are often hindered by the problem of representation collapse in the latent space, which leads to low codebook utilization and limits the scalability of the codebook for large-scale training. Existing methods designed to mitigate representation collapse typically reduce the dimensionality of latent space at the expense of model capacity, which do not fully resolve the core issue. In this study, we conduct a theoretical analysis of representation collapse in VQ models and identify its primary cause as the disjoint optimization of the codebook, where only a small subset of code vectors are updated through gradient descent. To address this issue, we propose extbf{SimVQ}, a novel method which reparameterizes the code vectors through a linear transformation layer based on a learnable latent basis. This transformation optimizes the extit{entire linear space} spanned by the codebook, rather than merely updating extit{the code vector} selected by the nearest-neighbor search in vanilla VQ models. Although it is commonly understood that the multiplication of two linear matrices is equivalent to applying a single linear layer, our approach works surprisingly well in resolving the collapse issue in VQ models with just one linear layer. We validate the efficacy of SimVQ through extensive experiments across various modalities, including image and audio data with different model architectures. Our code is available at url{https://github.com/youngsheen/SimVQ}.