🤖 AI Summary
To address the high computational cost, low accuracy, and poor generalizability of distance computation in high-dimensional approximate k-nearest neighbor (AKNN) search, this paper proposes a data-distribution-aware orthogonal projection distance estimation method with a decoupled, data-driven correction scheme. It is the first to incorporate explicit data distribution modeling into orthogonal projection-based distance estimation and fully decouples distance approximation from correction—thereby jointly optimizing efficiency, accuracy, and generality. The approach integrates orthogonal projection for dimensionality reduction, a lightweight data-driven correction model, high-dimensional index optimization, and accelerated distance computation mechanisms. Extensive experiments on multiple real-world datasets demonstrate that our method achieves 1.6–2.1× higher retrieval speed than ADSampling, while significantly improving recall and distance estimation accuracy.
📝 Abstract
Approximate K Nearest Neighbor (AKNN) search in high-dimensional spaces is a critical yet challenging problem. In AKNN search, distance computation is the core task that dominates the runtime. Existing approaches typically use approximate distances to improve computational efficiency, often at the cost of reduced search accuracy. To address this issue, the state-of-the-art method, ADSampling, employs random projections to estimate approximate distances and introduces an additional distance correction process to mitigate accuracy loss. However, ADSampling has limitations in both effectiveness and generality, primarily due to its reliance on random projections for distance approximation and correction. To address the effectiveness limitations of ADSampling, we leverage data distribution to improve distance computation via orthogonal projection. Furthermore, to overcome the generality limitations of ADSampling, we adopt a data-driven approach to distance correction, decoupling the correction process from the distance approximation process. Extensive experiments demonstrate the superiority and effectiveness of our method. In particular, compared to ADSampling, our method achieves a speedup of 1.6 to 2.1 times on real-world datasets while providing higher accuracy.