🤖 AI Summary
Existing approximate nearest neighbor (ANN) graph indexes (e.g., HNSW) support only insertions and lack efficient, high-quality dynamic deletion mechanisms—leading to degraded recall, increased query latency, or prohibitively high deletion overhead. This paper introduces the first dynamic ANN indexing framework grounded in random walk theory, which strictly preserves the original hitting-time statistics after deletions—a theoretical first. We further propose a deterministic deletion algorithm that jointly optimizes query latency, recall, deletion time, and memory footprint through dynamic graph maintenance and multi-layer navigation refinement. Extensive experiments demonstrate that, compared to state-of-the-art deletion methods, our approach achieves up to 12% higher recall, 40% faster deletion, 22% lower query latency, and 18% reduced memory consumption.
📝 Abstract
Approximate nearest neighbor search (ANN) is a common way to retrieve relevant search results, especially now in the context of large language models and retrieval augmented generation. One of the most widely used algorithms for ANN is based on constructing a multi-layer graph over the dataset, called the Hierarchical Navigable Small World (HNSW). While this algorithm supports insertion of new data, it does not support deletion of existing data. Moreover, deletion algorithms described by prior work come at the cost of increased query latency, decreased recall, or prolonged deletion time. In this paper, we propose a new theoretical framework for graph-based ANN based on random walks. We then utilize this framework to analyze a randomized deletion approach that preserves hitting time statistics compared to the graph before deleting the point. We then turn this theoretical framework into a deterministic deletion algorithm, and show that it provides better tradeoff between query latency, recall, deletion time, and memory usage through an extensive collection of experiments.