🤖 AI Summary
To address the challenges of high-throughput, low-latency approximate nearest neighbor search (ANNS) for billion-scale vector databases in retrieval-augmented generation (RAG) scenarios, this paper proposes the first fully in-memory ANNS system leveraging CXL memory. Our method innovatively integrates general-purpose compute cores directly into CXL memory devices to enable full hardware offloading of the ANNS pipeline; we further design a rank-level parallel distance computation architecture and a graph-structure-aware data placement strategy to jointly optimize computational efficiency and memory access locality. The system supports in-memory vector index construction and real-time search acceleration. Evaluated on SIFT1B and DEEP1B, it achieves 6.72× and 2.35× higher throughput than baseline and state-of-the-art CXL-based ANNS systems, respectively. This advancement significantly improves the scalability and end-to-end performance of RAG pipelines.
📝 Abstract
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.