🤖 AI Summary
Existing vector search algorithms treat the metric structure of embeddings as a constraint while neglecting their intrinsic geometric properties, leading to degraded performance in high-dimensional approximate nearest neighbor (ANN) retrieval. To address this, we propose the *q-metric space projection framework*, which maps vectors under arbitrary distance functions into a q-metric space satisfying a strong triangle inequality, thereby preserving exact nearest-neighbor relationships while significantly enhancing distance discriminability. We theoretically prove that as $ q o infty $, the search complexity reduces to logarithmic scale. Furthermore, we design a differentiable projection network for end-to-end learning and reformulate classical metric trees—including VP-trees and cover trees—within this framework. Empirical evaluation on text and image retrieval benchmarks demonstrates substantial improvements over conventional metric-tree baselines, matching state-of-the-art non-metric methods while achieving up to 3.2× lower query latency.
📝 Abstract
Despite the ubiquity of vector search applications, prevailing search algorithms overlook the metric structure of vector embeddings, treating it as a constraint rather than exploiting its underlying properties. In this paper, we demonstrate that in $q$-metric spaces, metric trees can leverage a stronger version of the triangle inequality to reduce comparisons for exact search. Notably, as $q$ approaches infinity, the search complexity becomes logarithmic. Therefore, we propose a novel projection method that embeds vector datasets with arbitrary dissimilarity measures into $q$-metric spaces while preserving the nearest neighbor. We propose to learn an approximation of this projection to efficiently transform query points to a space where euclidean distances satisfy the desired properties. Our experimental results with text and image vector embeddings show that learning $q$-metric approximations enables classic metric tree algorithms -- which typically underperform with high-dimensional data -- to achieve competitive performance against state-of-the-art search methods.