Toward Efficient and Scalable Design of In-Memory Graph-Based Vector Search

📅 2025-09-06

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This paper addresses in-memory graph-based approximate nearest neighbor (ANN) search for billion-scale high-dimensional vector data. We systematically evaluate 12 state-of-the-art algorithms across seven real-world datasets. For the first time, we conduct a unified benchmark of mainstream graph construction paradigms—including seed selection, incremental insertion, neighbor propagation, diversification, and divide-and-conquer—revealing that incremental insertion and neighbor diversification offer superior accuracy-efficiency trade-offs, while initial graph topology critically determines scalability. We identify data-adaptive seed selection and diversification as key optimization directions. Experimental results demonstrate that graph construction methods integrating incremental updates with diversity constraints achieve the best balance among retrieval quality, throughput, and memory overhead. Our findings provide empirical evidence and principled design guidelines for engineering large-scale vector search systems.

Technology Category

Application Category

📝 Abstract

Vector data is prevalent across business and scientific applications, and its popularity is growing with the proliferation of learned embeddings. Vector data collections often reach billions of vectors with thousands of dimensions, thus, increasing the complexity of their analysis. Vector search is the backbone of many critical analytical tasks, and graph-based methods have become the best choice for analytical tasks that do not require guarantees on the quality of the answers. Although several paradigms (seed selection, incremental insertion, neighborhood propagation, neighborhood diversification, and divide-and-conquer) have been employed to design in-memory graph-based vector search algorithms, a systematic comparison of the key algorithmic advances is still missing. We conduct an exhaustive experimental evaluation of twelve state-of-the-art methods on seven real data collections, with sizes up to 1 billion vectors. We share key insights about the strengths and limitations of these methods; e.g., the best approaches are typically based on incremental insertion and neighborhood diversification, and the choice of the base graph can hurt scalability. Finally, we discuss open research directions, such as the importance of devising more sophisticated data adaptive seed selection and diversification strategies.

Problem

Research questions and friction points this paper is trying to address.

Optimizing in-memory graph vector search efficiency

Scaling algorithms for billion-scale vector datasets

Comparing paradigms for high-dimensional vector analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

In-memory graph-based vector search algorithms

Incremental insertion and neighborhood diversification methods

Data adaptive seed selection and diversification strategies

🔎 Similar Papers

Accelerating Graph-based Vector Search via Delayed-Synchronization Traversal