Approximate Nearest Neighbor Search of Large Scale Vectors on Distributed Storage

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Existing approximate nearest neighbor search (ANNS) methods for high-dimensional, large-scale vectors rely on single-machine storage, suffering from poor scalability, high infrastructure costs, and single-point failures. To address this, we propose DSANN—a distributed ANNS system designed for scalable, fault-tolerant vector retrieval. Its core contributions are: (1) a hybrid graph-cluster indexing structure that synergistically leverages the high connectivity of graphs and the locality-preserving properties of clustering; (2) a concurrent indexing algorithm that reduces index construction complexity; (3) point-aggregation-based graph compression to minimize storage overhead and accelerate query processing; and (4) asynchronous I/O optimization to enhance distributed read performance. Extensive experiments on billion-scale, high-dimensional vector datasets demonstrate that DSANN significantly improves recall and throughput over state-of-the-art baselines, while ensuring high availability and near-linear scalability across distributed nodes.

Technology Category

Application Category

📝 Abstract

Approximate Nearest Neighbor Search (ANNS) in high-dimensional space is an essential operator in many online services, such as information retrieval and recommendation. Indices constructed by the state-of-the-art ANNS algorithms must be stored in single machine's memory or disk for high recall rate and throughput, suffering from substantial storage cost, constraint of limited scale and single point of failure. While distributed storage can provide a cost-effective and robust solution, there is no efficient and effective algorithms for indexing vectors in distributed storage scenarios. In this paper, we present a new graph-cluster hybrid indexing and search system which supports Distributed Storage Approximate Nearest Neighbor Search, called DSANN. DSANN can efficiently index, store, search billion-scale vector database in distributed storage and guarantee the high availability of index service. DSANN employs the concurrent index construction method to significantly reduces the complexity of index building. Then, DSANN applies Point Aggregation Graph to leverage the structural information of graph to aggregate similar vectors, optimizing storage efficiency and improving query throughput via asynchronous I/O in distributed storage. Through extensive experiments, we demonstrate DSANN can efficiently and effectively index, store and search large-scale vector datasets in distributed storage scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses efficient ANNS on distributed storage for billion-scale vectors

Solves single machine limitations in storage capacity and availability

Optimizes distributed indexing through graph-cluster hybrid methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-cluster hybrid indexing for distributed storage

Concurrent index construction reducing complexity

Point Aggregation Graph optimizing storage via asynchronous I/O

🔎 Similar Papers

A Parametrizable Algorithm for Distributed Approximate Similarity Search with Arbitrary Distances