Effective and General Distance Computation for Approximate Nearest Neighbor Search

📅 2024-04-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high computational cost, low accuracy, and poor generalizability of distance computation in high-dimensional approximate k-nearest neighbor (AKNN) search, this paper proposes a data-distribution-aware orthogonal projection distance estimation method with a decoupled, data-driven correction scheme. It is the first to incorporate explicit data distribution modeling into orthogonal projection-based distance estimation and fully decouples distance approximation from correction—thereby jointly optimizing efficiency, accuracy, and generality. The approach integrates orthogonal projection for dimensionality reduction, a lightweight data-driven correction model, high-dimensional index optimization, and accelerated distance computation mechanisms. Extensive experiments on multiple real-world datasets demonstrate that our method achieves 1.6–2.1× higher retrieval speed than ADSampling, while significantly improving recall and distance estimation accuracy.

Technology Category

Application Category

📝 Abstract
Approximate K Nearest Neighbor (AKNN) search in high-dimensional spaces is a critical yet challenging problem. In AKNN search, distance computation is the core task that dominates the runtime. Existing approaches typically use approximate distances to improve computational efficiency, often at the cost of reduced search accuracy. To address this issue, the state-of-the-art method, ADSampling, employs random projections to estimate approximate distances and introduces an additional distance correction process to mitigate accuracy loss. However, ADSampling has limitations in both effectiveness and generality, primarily due to its reliance on random projections for distance approximation and correction. To address the effectiveness limitations of ADSampling, we leverage data distribution to improve distance computation via orthogonal projection. Furthermore, to overcome the generality limitations of ADSampling, we adopt a data-driven approach to distance correction, decoupling the correction process from the distance approximation process. Extensive experiments demonstrate the superiority and effectiveness of our method. In particular, compared to ADSampling, our method achieves a speedup of 1.6 to 2.1 times on real-world datasets while providing higher accuracy.
Problem

Research questions and friction points this paper is trying to address.

High-dimensional Space
Approximate Nearest Neighbor Search
Efficiency and Accuracy Balance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Improved AKNN
Data-driven Strategy
Orthogonal Projection
M
Mingyu Yang
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology
W
Wentao Li
The Hong Kong University of Science and Technology (Guangzhou), University of Leicester
Jiabao Jin
Jiabao Jin
Ant Group
Vector DataBase
X
Xiaoyao Zhong
Ant Group
Xiangyu Wang
Xiangyu Wang
Professor, Curtin University
Civil EngineeringBuilding Information ModelingSmart CityAutomation and RoboticsSmart
Zhitao Shen
Zhitao Shen
Ant Group
databasedata storage
W
Wei Jia
Ant Group
W
Wei Wang
The Hong Kong University of Science and Technology (Guangzhou), The Hong Kong University of Science and Technology