CALL: Context-Aware Low-Latency Retrieval in Disk-Based Vector Databases

📅 2025-09-23

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Disk-based vector databases suffer from poor cache locality and high tail latency due to non-uniform access patterns induced by semantic similarity queries. To address this, we propose a context-aware low-latency retrieval mechanism. Our method dynamically groups queries based on embedding-level semantic similarity and introduces group-aware prefetching alongside latency-aware cluster loading—explicitly modeling intra-group locality and inter-group switching overhead. Crucially, the approach requires no modification to the underlying storage layout, instead optimizing cache behavior through coordinated query scheduling and prefetching. Evaluations on mainstream benchmarks demonstrate a 33% reduction in 99th-percentile latency, significant improvement in cache hit rate, and substantial decrease in end-to-end retrieval response time.

Technology Category

Application Category

📝 Abstract

Embedding models capture both semantic and syntactic structures of queries, often mapping different queries to similar regions in vector space. This results in non-uniform cluster access patterns in modern disk-based vector databases. While existing approaches optimize individual queries, they overlook the impact of cluster access patterns, failing to account for the locality effects of queries that access similar clusters. This oversight increases cache miss penalty. To minimize the cache miss penalty, we propose CALL, a context-aware query grouping mechanism that organizes queries based on shared cluster access patterns. Additionally, CALL incorporates a group-aware prefetching method to minimize cache misses during transitions between query groups and latency-aware cluster loading. Experimental results show that CALL reduces the 99th percentile tail latency by up to 33% while consistently maintaining a higher cache hit ratio, substantially reducing search latency.

Problem

Research questions and friction points this paper is trying to address.

Optimizing non-uniform cluster access patterns in disk-based vector databases

Reducing cache miss penalty caused by similar query mappings in vector space

Minimizing search latency through context-aware query grouping and prefetching

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context-aware query grouping by cluster access patterns

Group-aware prefetching to minimize cache misses

Latency-aware cluster loading for reduced tail latency

🔎 Similar Papers

The Faiss library