Breaking the Storage-Compute Bottleneck in Billion-Scale ANNS: A GPU-Driven Asynchronous I/O Framework

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address the low throughput of existing disk-based approximate nearest neighbor search (ANNS) systems for billion-scale vector retrieval—primarily caused by non-overlapped SSD I/O and distance computation, as well as high I/O stack latency—this paper proposes the first GPU-centric asynchronous I/O ANNS framework. Our method introduces three key innovations: (1) a dependency-relaxed asynchronous pipeline enabling deep overlap between SSD I/O and GPU computation; (2) warp-level fine-grained SSD concurrency for improved I/O parallelism; and (3) a compute-I/O balanced, graph-degree adaptive selection strategy. The framework integrates a graph-based index, a lock-free I/O stack, and a lightweight sampling mechanism. Experiments demonstrate that our approach achieves 2.3–5.9× higher throughput than state-of-the-art methods on a single SSD, and 2.7–12.2× improvement with multiple SSDs, while maintaining high recall.

Technology Category

Application Category

📝 Abstract

With the advancement of information retrieval, recommendation systems, and Retrieval-Augmented Generation (RAG), Approximate Nearest Neighbor Search (ANNS) gains widespread applications due to its higher performance and accuracy. While several disk-based ANNS systems have emerged to handle exponentially growing vector datasets, they suffer from suboptimal performance due to two inherent limitations: 1) failing to overlap SSD accesses with distance computation processes and 2) extended I/O latency caused by suboptimal I/O Stack. To address these challenges, we present FlashANNS, a GPU-accelerated out-of-core graph-based ANNS system through I/O-compute overlapping. Our core insight lies in the synchronized orchestration of I/O and computation through three key innovations: 1) Dependency-Relaxed asynchronous pipeline: FlashANNS decouples I/O-computation dependencies to fully overlap between GPU distance calculations and SSD data transfers. 2) Warp-Level concurrent SSD access: FlashANNS implements a lock-free I/O stack with warp-level concurrency control, to reduce the latency-induced time overhead. 3) Computation-I/O balanced graph degree Selection: FlashANNS selects graph degrees via lightweight compute-to-I/O ratio sampling, ensuring optimal balance between computational load and storage access latency across different I/O bandwidth configurations. We implement FlashANNS and compare it with state-of-the-art out-of-core ANNS systems (SPANN, DiskANN) and a GPU-accelerated out-of-core ANNS system (FusionANNS). Experimental results demonstrate that at $geq$95% recall@10 accuracy, our method achieves 2.3-5.9$ imes$ higher throughput compared to existing SOTA methods with a single SSD, and further attains 2.7-12.2$ imes$ throughput improvement in multi-SSD configurations.

Problem

Research questions and friction points this paper is trying to address.

Overlapping SSD access with GPU distance computation

Reducing I/O latency via optimized warp-level concurrency

Balancing graph degree for compute-I/O efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dependency-Relaxed asynchronous pipeline for I/O-compute overlap

Warp-Level concurrent SSD access with lock-free control

Computation-I/O balanced graph degree selection

🔎 Similar Papers

BANG: Billion-Scale Approximate Nearest Neighbor Search using a Single GPU