Approximate Nearest Neighbor Search of Large Scale Vectors on Distributed Storage

πŸ“… 2025-10-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing approximate nearest neighbor search (ANNS) methods for high-dimensional, large-scale vectors rely on single-machine storage, suffering from poor scalability, high infrastructure costs, and single-point failures. To address this, we propose DSANNβ€”a distributed ANNS system designed for scalable, fault-tolerant vector retrieval. Its core contributions are: (1) a hybrid graph-cluster indexing structure that synergistically leverages the high connectivity of graphs and the locality-preserving properties of clustering; (2) a concurrent indexing algorithm that reduces index construction complexity; (3) point-aggregation-based graph compression to minimize storage overhead and accelerate query processing; and (4) asynchronous I/O optimization to enhance distributed read performance. Extensive experiments on billion-scale, high-dimensional vector datasets demonstrate that DSANN significantly improves recall and throughput over state-of-the-art baselines, while ensuring high availability and near-linear scalability across distributed nodes.

Technology Category

Application Category

πŸ“ Abstract
Approximate Nearest Neighbor Search (ANNS) in high-dimensional space is an essential operator in many online services, such as information retrieval and recommendation. Indices constructed by the state-of-the-art ANNS algorithms must be stored in single machine's memory or disk for high recall rate and throughput, suffering from substantial storage cost, constraint of limited scale and single point of failure. While distributed storage can provide a cost-effective and robust solution, there is no efficient and effective algorithms for indexing vectors in distributed storage scenarios. In this paper, we present a new graph-cluster hybrid indexing and search system which supports Distributed Storage Approximate Nearest Neighbor Search, called DSANN. DSANN can efficiently index, store, search billion-scale vector database in distributed storage and guarantee the high availability of index service. DSANN employs the concurrent index construction method to significantly reduces the complexity of index building. Then, DSANN applies Point Aggregation Graph to leverage the structural information of graph to aggregate similar vectors, optimizing storage efficiency and improving query throughput via asynchronous I/O in distributed storage. Through extensive experiments, we demonstrate DSANN can efficiently and effectively index, store and search large-scale vector datasets in distributed storage scenarios.
Problem

Research questions and friction points this paper is trying to address.

Addresses efficient ANNS on distributed storage for billion-scale vectors
Solves single machine limitations in storage capacity and availability
Optimizes distributed indexing through graph-cluster hybrid methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-cluster hybrid indexing for distributed storage
Concurrent index construction reducing complexity
Point Aggregation Graph optimizing storage via asynchronous I/O
πŸ”Ž Similar Papers
No similar papers found.
K
Kun Yu
ECNU, Shanghai, China
Jiabao Jin
Jiabao Jin
Ant Group
Vector DataBase
X
Xiaoyao Zhong
Ant Group, Shanghai, China
P
Peng Cheng
Tongji University, Shanghai, China
L
Lei Chen
HKUST (GZ), Guangzhou, China; HKUST, Hong Kong SAR, China
Zhitao Shen
Zhitao Shen
Ant Group
databasedata storage
J
Jingkuan Song
Tongji University, Shanghai, China
H
Hengtao Shen
Tongji University, Shanghai, China
X
Xuemin Lin
SJTU, Shanghai, China