π€ AI Summary
Existing approximate nearest neighbor search (ANNS) methods for high-dimensional, large-scale vectors rely on single-machine storage, suffering from poor scalability, high infrastructure costs, and single-point failures. To address this, we propose DSANNβa distributed ANNS system designed for scalable, fault-tolerant vector retrieval. Its core contributions are: (1) a hybrid graph-cluster indexing structure that synergistically leverages the high connectivity of graphs and the locality-preserving properties of clustering; (2) a concurrent indexing algorithm that reduces index construction complexity; (3) point-aggregation-based graph compression to minimize storage overhead and accelerate query processing; and (4) asynchronous I/O optimization to enhance distributed read performance. Extensive experiments on billion-scale, high-dimensional vector datasets demonstrate that DSANN significantly improves recall and throughput over state-of-the-art baselines, while ensuring high availability and near-linear scalability across distributed nodes.
π Abstract
Approximate Nearest Neighbor Search (ANNS) in high-dimensional space is an essential operator in many online services, such as information retrieval and recommendation. Indices constructed by the state-of-the-art ANNS algorithms must be stored in single machine's memory or disk for high recall rate and throughput, suffering from substantial storage cost, constraint of limited scale and single point of failure. While distributed storage can provide a cost-effective and robust solution, there is no efficient and effective algorithms for indexing vectors in distributed storage scenarios. In this paper, we present a new graph-cluster hybrid indexing and search system which supports Distributed Storage Approximate Nearest Neighbor Search, called DSANN. DSANN can efficiently index, store, search billion-scale vector database in distributed storage and guarantee the high availability of index service. DSANN employs the concurrent index construction method to significantly reduces the complexity of index building. Then, DSANN applies Point Aggregation Graph to leverage the structural information of graph to aggregate similar vectors, optimizing storage efficiency and improving query throughput via asynchronous I/O in distributed storage. Through extensive experiments, we demonstrate DSANN can efficiently and effectively index, store and search large-scale vector datasets in distributed storage scenarios.