🤖 AI Summary
Existing methods for billion-scale dynamic vector retrieval face three key challenges: memory constraints (e.g., HNSW), inefficient updates (e.g., DiskANN’s reliance on offline graph construction), and a fundamental trade-off between recall and throughput (e.g., SPFresh’s coarse-grained partitioning). This paper introduces the first disk-native storage system that tightly integrates hierarchical proximity graphs with an LSM-tree architecture, enabling异地 incremental updates without global reconstruction. We propose a novel sampling-based probabilistic search strategy and a connectivity-driven, block-level graph reordering mechanism—jointly optimizing I/O efficiency and recall. Evaluated on billion-scale datasets, our system achieves significantly higher recall than DiskANN and SPFresh, while reducing query and update latency and cutting memory footprint by over 66.2%.
📝 Abstract
Vector search underpins modern AI applications by supporting approximate nearest neighbor (ANN) queries over high-dimensional embeddings in tasks like retrieval-augmented generation (RAG), recommendation systems, and multimodal search. Traditional ANN search indices (e.g., HNSW) are limited by memory constraints at large data scale. Disk-based indices such as DiskANN reduce memory overhead but rely on offline graph construction, resulting in costly and inefficient vector updates. The state-of-the-art clustering-based approach SPFresh offers better scalability but suffers from reduced recall due to coarse partitioning. Moreover, SPFresh employs in-place updates to maintain its index structure, limiting its efficiency in handling high-throughput insertions and deletions under dynamic workloads. This paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage. By distributing the proximity graph across multiple LSM-tree levels, LSM-VEC supports out-of-place vector updates. It enhances search efficiency via a sampling-based probabilistic search strategy with adaptive neighbor selection, and connectivity-aware graph reordering further reduces I/O without requiring global reconstruction. Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems. It achieves higher recall, lower query and update latency, and reduces memory footprint by over 66.2%, making it well-suited for real-world large-scale vector search with dynamic updates.