AlayaLaser: Efficient Index Layout and Search Strategy for Large-scale High-dimensional Vector Similarity Search

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing graph-based disk indexing systems for large-scale high-dimensional vector similarity search: their performance is constrained by overlooking computational overhead, as the true bottleneck in high-dimensional settings lies in computation rather than I/O. The study is the first to reveal the intrinsic nature of this computational bottleneck and proposes a novel computation-optimized disk data layout that fully exploits modern CPU SIMD instructions. The approach integrates degree-based node caching, cluster-driven entry point selection, and an early scheduling strategy. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art disk-based graph index systems across multiple large-scale high-dimensional datasets, achieving performance comparable to—or even surpassing—that of in-memory indexing schemes, thereby transcending the traditional I/O-centric design paradigm.

Technology Category

Application Category

📝 Abstract
On-disk graph-based approximate nearest neighbor search (ANNS) is essential for large-scale, high-dimensional vector retrieval, yet its performance is widely recognized to be limited by the prohibitive I/O costs. Interestingly, we observed that the performance of on-disk graph-based index systems is compute-bound, not I/O-bound, with the rising of the vector data dimensionality (e.g., hundreds or thousands). This insight uncovers a significant optimization opportunity: existing on-disk graph-based index systems universally target I/O reduction and largely overlook computational overhead, which leaves a substantial performance improvement space. In this work, we propose AlayaLaser, an efficient on-disk graph-based index system for large-scale high-dimensional vector similarity search. In particular, we first conduct performance analysis on existing on-disk graph-based index systems via the adapted roofline model, then we devise a novel on-disk data layout in AlayaLaser to effectively alleviate the compute-bound, which is revealed by the above roofline model analysis, by exploiting SIMD instructions on modern CPUs. We next design a suite of optimization techniques (e.g., degree-based node cache, cluster-based entry point selection, and early dispatch strategy) to further improve the performance of AlayaLaser. We last conduct extensive experimental studies on a wide range of large-scale high-dimensional vector datasets to verify the superiority of AlayaLaser. Specifically, AlayaLaser not only surpasses existing on-disk graph-based index systems but also matches or even exceeds the performance of in-memory index systems.
Problem

Research questions and friction points this paper is trying to address.

on-disk ANNS
high-dimensional vectors
compute-bound
vector similarity search
graph-based index
Innovation

Methods, ideas, or system contributions that make the work stand out.

compute-bound optimization
SIMD acceleration
on-disk graph index
high-dimensional vector search
roofline model analysis
🔎 Similar Papers
No similar papers found.
W
Weijian Chen
SUSTech
H
Haotian Liu
AlayaDB AI
Y
Yangshen Deng
University of Edinburgh
L
Long Xiang
AlayaDB AI
L
Liang Huang
SUSTech
G
Gezi Li
Huawei
Bo Tang
Bo Tang
Southern University of Science and Technology
Data ManagementDatabase