LSM-VEC: A Large-Scale Disk-Based System for Dynamic Vector Search

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing methods for billion-scale dynamic vector retrieval face three key challenges: memory constraints (e.g., HNSW), inefficient updates (e.g., DiskANN’s reliance on offline graph construction), and a fundamental trade-off between recall and throughput (e.g., SPFresh’s coarse-grained partitioning). This paper introduces the first disk-native storage system that tightly integrates hierarchical proximity graphs with an LSM-tree architecture, enabling异地 incremental updates without global reconstruction. We propose a novel sampling-based probabilistic search strategy and a connectivity-driven, block-level graph reordering mechanism—jointly optimizing I/O efficiency and recall. Evaluated on billion-scale datasets, our system achieves significantly higher recall than DiskANN and SPFresh, while reducing query and update latency and cutting memory footprint by over 66.2%.

Technology Category

Application Category

📝 Abstract

Vector search underpins modern AI applications by supporting approximate nearest neighbor (ANN) queries over high-dimensional embeddings in tasks like retrieval-augmented generation (RAG), recommendation systems, and multimodal search. Traditional ANN search indices (e.g., HNSW) are limited by memory constraints at large data scale. Disk-based indices such as DiskANN reduce memory overhead but rely on offline graph construction, resulting in costly and inefficient vector updates. The state-of-the-art clustering-based approach SPFresh offers better scalability but suffers from reduced recall due to coarse partitioning. Moreover, SPFresh employs in-place updates to maintain its index structure, limiting its efficiency in handling high-throughput insertions and deletions under dynamic workloads. This paper presents LSM-VEC, a disk-based dynamic vector index that integrates hierarchical graph indexing with LSM-tree storage. By distributing the proximity graph across multiple LSM-tree levels, LSM-VEC supports out-of-place vector updates. It enhances search efficiency via a sampling-based probabilistic search strategy with adaptive neighbor selection, and connectivity-aware graph reordering further reduces I/O without requiring global reconstruction. Experiments on billion-scale datasets demonstrate that LSM-VEC consistently outperforms existing disk-based ANN systems. It achieves higher recall, lower query and update latency, and reduces memory footprint by over 66.2%, making it well-suited for real-world large-scale vector search with dynamic updates.

Problem

Research questions and friction points this paper is trying to address.

Memory constraints limit large-scale vector search performance

Disk-based indices struggle with costly dynamic vector updates

Existing approaches sacrifice recall or efficiency for scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical graph indexing with LSM-tree storage

Sampling-based probabilistic search strategy

Connectivity-aware graph reordering reduces I/O

🔎 Similar Papers

The Faiss library