Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images

📅 2026-04-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Permutation-Invariant Vector Quantization (PI-VQ), a novel discrete representation framework that overcomes the entanglement between codebook entries and spatial positions inherent in conventional approaches like VQ-VAE. By enforcing permutation invariance in the latent codes, PI-VQ learns global semantic features independent of location, enabling direct interpolation-based image generation without requiring a trained prior model. The method introduces a matching quantization algorithm based on optimal bipartite graph matching, which increases bottleneck capacity by 3.5× and allows synthesis of novel images in a single forward pass. Experiments on CelebA, CelebA-HQ, and FFHQ demonstrate that PI-VQ achieves superior performance in terms of precision, density, and coverage, validating the efficacy of position-agnostic discrete representations for generative modeling.
📝 Abstract
Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
Problem

Research questions and friction points this paper is trying to address.

permutation-invariant
discrete representation learning
vector quantization
spatially aligned images
position-free representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

permutation-invariant representation
vector quantization
matching quantization
discrete representation learning
position-free latent codes
🔎 Similar Papers
No similar papers found.
J
Jamie S. J. Stirling
Durham University, United Kingdom
N
Noura Al-Moubayed
Durham University, United Kingdom
Hubert P. H. Shum
Hubert P. H. Shum
Professor of Visual Computing, Director of Research in Computer Science, Durham University
Responsible AIComputer VisionComputer GraphicsAI in Healthcare