🤖 AI Summary
This paper addresses in-memory graph-based approximate nearest neighbor (ANN) search for billion-scale high-dimensional vector data. We systematically evaluate 12 state-of-the-art algorithms across seven real-world datasets. For the first time, we conduct a unified benchmark of mainstream graph construction paradigms—including seed selection, incremental insertion, neighbor propagation, diversification, and divide-and-conquer—revealing that incremental insertion and neighbor diversification offer superior accuracy-efficiency trade-offs, while initial graph topology critically determines scalability. We identify data-adaptive seed selection and diversification as key optimization directions. Experimental results demonstrate that graph construction methods integrating incremental updates with diversity constraints achieve the best balance among retrieval quality, throughput, and memory overhead. Our findings provide empirical evidence and principled design guidelines for engineering large-scale vector search systems.
📝 Abstract
Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. Although several paradigms (seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer) have been employed to design in-memory graph-based vector search algorithms, a systematic comparison of the key algorithmic advances is still missing. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data adaptive seed selection and diversification strategies.