Towards Efficient and Scalable Distributed Vector Search with RDMA

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address scalability limitations in large-scale vector retrieval caused by single-machine memory and bandwidth bottlenecks, this paper proposes an algorithm-system co-designed distributed approximate nearest neighbor (ANN) search framework. Methodologically, it introduces (1) a collaborative asynchronous search architecture that integrates clustering-aware data partitioning and task-pushing to minimize communication overhead; and (2) RDMA acceleration, communication batching, a custom storage format, and a lightweight distributed ANN algorithm. Evaluated on a 16-node cluster, the system achieves 9.8–13.4× higher throughput than a single machine and 2.12–3.58× improvement over the best baseline, while maintaining 95% recall@10. The framework thus simultaneously delivers high throughput, low latency, and high accuracy.

Technology Category

Application Category

📝 Abstract
Similarity-based vector search facilitates many important applications such as search and recommendation but is limited by the memory capacity and bandwidth of a single machine due to large datasets and intensive data read. In this paper, we present CoTra, a system that scales up vector search for distributed execution. We observe a tension between computation and communication efficiency, which is the main challenge for good scalability, i.e., handling the local vectors on each machine independently blows up computation as the pruning power of vector index is not fully utilized, while running a global index over all machines introduces rich data dependencies and thus extensive communication. To resolve such tension, we leverage the fact that vector search is approximate in nature and robust to asynchronous execution. In particular, we run collaborative vector search over the machines with algorithm-system co-designs including clustering-based data partitioning to reduce communication, asynchronous execution to avoid communication stall, and task push to reduce network traffic. To make collaborative search efficient, we introduce a suite of system optimizations including task scheduling, communication batching, and storage format. We evaluate CoTra on real datasets and compare with four baselines. The results show that when using 16 machines, the query throughput of CoTra scales to 9.8-13.4x over a single machine and is 2.12-3.58x of the best-performing baseline at 0.95 recall@10.
Problem

Research questions and friction points this paper is trying to address.

Scalable distributed vector search with RDMA
Balancing computation and communication efficiency
Optimizing collaborative search with algorithm-system co-designs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Clustering-based data partitioning reduces communication
Asynchronous execution avoids communication stall
Task push minimizes network traffic
🔎 Similar Papers
No similar papers found.
X
Xiangyu Zhi
The Chinese University of Hong Kong
M
Meng Chen
Fudan University
X
Xiao Yan
Wuhan University
Baotong Lu
Baotong Lu
Microsoft Research
Database SystemsMachine Learning Systems
H
Hui Li
The Chinese University of Hong Kong
Qianxi Zhang
Qianxi Zhang
MSRA
database
Q
Qi Chen
Microsoft Research
J
James Cheng
The Chinese University of Hong Kong