🤖 AI Summary
Disk-based vector databases suffer from poor cache locality and high tail latency due to non-uniform access patterns induced by semantic similarity queries. To address this, we propose a context-aware low-latency retrieval mechanism. Our method dynamically groups queries based on embedding-level semantic similarity and introduces group-aware prefetching alongside latency-aware cluster loading—explicitly modeling intra-group locality and inter-group switching overhead. Crucially, the approach requires no modification to the underlying storage layout, instead optimizing cache behavior through coordinated query scheduling and prefetching. Evaluations on mainstream benchmarks demonstrate a 33% reduction in 99th-percentile latency, significant improvement in cache hit rate, and substantial decrease in end-to-end retrieval response time.
📝 Abstract
Embedding models capture both semantic and syntactic structures of queries, often mapping different queries to similar regions in vector space. This results in non-uniform cluster access patterns in modern disk-based vector databases. While existing approaches optimize individual queries, they overlook the impact of cluster access patterns, failing to account for the locality effects of queries that access similar clusters. This oversight increases cache miss penalty. To minimize the cache miss penalty, we propose CALL, a context-aware query grouping mechanism that organizes queries based on shared cluster access patterns. Additionally, CALL incorporates a group-aware prefetching method to minimize cache misses during transitions between query groups and latency-aware cluster loading. Experimental results show that CALL reduces the 99th percentile tail latency by up to 33% while consistently maintaining a higher cache hit ratio, substantially reducing search latency.