🤖 AI Summary
To address the challenge of efficiently deploying a 50-billion-vector disk-based ANN index across thousands of nodes, this paper proposes a novel cross-node scalable architecture that abandons the conventional sharding-and-routing paradigm. Instead, it enables seamless distribution of a single DISKANN graph across hundreds to thousands of servers. The design integrates a distributed key-value store with in-memory approximate nearest neighbor indexing, jointly optimizing data placement, query routing, and load balancing. Evaluated in production, it achieves a median query latency of 26 ms and throughput exceeding 100,000 QPS. Compared to Bing’s prior horizontal scaling approach, this solution delivers a 6× performance improvement—marking the first demonstration of low-latency, high-throughput distributed retrieval over a single, ultra-large-scale graph index. The work establishes a new, scalable paradigm for industrial-grade vector search engines.
📝 Abstract
We present DISTRIBUTEDANN, a distributed vector search service that makes it possible to search over a single 50 billion vector graph index spread across over a thousand machines that offers 26ms median query latency and processes over 100,000 queries per second. This is 6x more efficient than existing partitioning and routing strategies that route the vector query to a subset of partitions in a scale out vector search system. DISTRIBUTEDANN is built using two well-understood components: a distributed key-value store and an in-memory ANN index. DISTRIBUTEDANN has replaced conventional scale-out architectures for serving the Bing search engine, and we share our experience from making this transition.