🤖 AI Summary
In high-dimensional vector retrieval, conventional unsupervised dimensionality reduction methods (e.g., PCA, UMAP) often degrade retrieval accuracy because their optimization objectives—unrelated to retrieval—fail to preserve neighborhood structures. To address this, we propose MPAD, the first unsupervised dimensionality reduction method explicitly designed for retrieval: it maximizes the pairwise absolute distance difference between k-nearest neighbors and non-neighbors under a soft orthogonality constraint, thereby directly optimizing the discriminative boundary for nearest-neighbor identification—without labels or fine-tuning. MPAD formulates an end-to-end framework integrating a distance-sensitive loss with a geometry-preserving mechanism. Extensive experiments across diverse benchmark datasets show that MPAD achieves 12–28% higher neighbor preservation rates than PCA and UMAP after dimensionality reduction, and its retrieval accuracy closely approaches that in the original high-dimensional space—effectively balancing precision and computational efficiency.
📝 Abstract
High-dimensional vector embeddings are widely used in retrieval systems, yet dimensionality reduction (DR) is seldom applied due to its tendency to distort nearest-neighbor (NN) structure critical for search. Existing DR techniques such as PCA and UMAP optimize global or manifold-preserving criteria, rather than retrieval-specific objectives. We present MPAD: Maximum Pairwise Absolute Difference, an unsupervised DR method that explicitly preserves approximate NN relations by maximizing the margin between k-NNs and non-k-NNs under a soft orthogonality constraint. This design enables MPAD to retain ANN-relevant geometry without supervision or changes to the original embedding model. Experiments across multiple domains show that MPAD consistently outperforms standard DR methods in preserving neighborhood structure, enabling more accurate search in reduced dimensions.