🤖 AI Summary
This work addresses a critical limitation in existing graph-based disk indexing systems for large-scale high-dimensional vector similarity search: their performance is constrained by overlooking computational overhead, as the true bottleneck in high-dimensional settings lies in computation rather than I/O. The study is the first to reveal the intrinsic nature of this computational bottleneck and proposes a novel computation-optimized disk data layout that fully exploits modern CPU SIMD instructions. The approach integrates degree-based node caching, cluster-driven entry point selection, and an early scheduling strategy. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art disk-based graph index systems across multiple large-scale high-dimensional datasets, achieving performance comparable to—or even surpassing—that of in-memory indexing schemes, thereby transcending the traditional I/O-centric design paradigm.
📝 Abstract
On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.