π€ AI Summary
This work addresses the high latency incurred by traditional approximate nearest neighbor search (ANNS) systems during the refinement stage, where full-precision vectors must be fetched from slow storageβa bottleneck exacerbated by the growing scale of large language models and multimodal embeddings. To overcome this challenge, the authors propose an efficient far-memory-oriented refinement mechanism that innovatively integrates hierarchical residual quantization with progressive distance estimation. Residual ternary codes are stored in far memory, while a custom accelerator built on CXL Type-2 devices enables low-latency local computation and supports early termination to avoid reading entire vectors. Compared to state-of-the-art GPU-based ANNS systems, the proposed approach achieves 2.4Γ higher storage efficiency and up to 9Γ greater throughput.
π Abstract
Approximate Nearest-Neighbor Search (ANNS) is a key technique in retrieval-augmented generation (RAG), enabling rapid identification of the most relevant high-dimensional embeddings from massive vector databases. Modern ANNS engines accelerate this process using prebuilt indexes and store compressed vector-quantized representations in fast memory. However, they still rely on a costly second-pass refinement stage that reads full-precision vectors from slower storage like SSDs. For modern text and multimodal embeddings, these reads now dominate the latency of the entire query. We propose FaTRQ, a far-memory-aware refinement system using tiered memory that eliminates the need to fetch full vectors from storage. It introduces a progressive distance estimator that refines coarse scores using compact residuals streamed from far memory. Refinement stops early once a candidate is provably outside the top-k. To support this, we propose tiered residual quantization, which encodes residuals as ternary values stored efficiently in far memory. A custom accelerator is deployed in a CXL Type-2 device to perform low-latency refinement locally. Together, FaTRQ improves the storage efficiency by 2.4$\times$ and improves the throughput by up to 9$ \times$ than SOTA GPU ANNS system.