🤖 AI Summary
This work proposes Permutation-Invariant Vector Quantization (PI-VQ), a novel discrete representation framework that overcomes the entanglement between codebook entries and spatial positions inherent in conventional approaches like VQ-VAE. By enforcing permutation invariance in the latent codes, PI-VQ learns global semantic features independent of location, enabling direct interpolation-based image generation without requiring a trained prior model. The method introduces a matching quantization algorithm based on optimal bipartite graph matching, which increases bottleneck capacity by 3.5× and allows synthesis of novel images in a single forward pass. Experiments on CelebA, CelebA-HQ, and FFHQ demonstrate that PI-VQ achieves superior performance in terms of precision, density, and coverage, validating the efficacy of position-agnostic discrete representations for generative modeling.
📝 Abstract
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.