COSMOS: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search

📅 2025-05-22

🏛️ IEEE computer architecture letters

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

To address the challenges of high-throughput, low-latency approximate nearest neighbor search (ANNS) for billion-scale vector databases in retrieval-augmented generation (RAG) scenarios, this paper proposes the first fully in-memory ANNS system leveraging CXL memory. Our method innovatively integrates general-purpose compute cores directly into CXL memory devices to enable full hardware offloading of the ANNS pipeline; we further design a rank-level parallel distance computation architecture and a graph-structure-aware data placement strategy to jointly optimize computational efficiency and memory access locality. The system supports in-memory vector index construction and real-time search acceleration. Evaluated on SIFT1B and DEEP1B, it achieves 6.72× and 2.35× higher throughput than baseline and state-of-the-art CXL-based ANNS systems, respectively. This advancement significantly improves the scalability and end-to-end performance of RAG pipelines.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.

Problem

Research questions and friction points this paper is trying to address.

Enabling high-throughput ANNS for billion-scale vector databases

Overcoming DRAM/SSD capacity-latency limits in RAG systems

Optimizing CXL-based in-memory ANNS with parallel computation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL memory devices for full ANNS offload

Rank-level parallel distance computation

Adjacency-aware data placement for load balance

🔎 Similar Papers

Dimensionality-Reduction Techniques for Approximate Nearest Neighbor Search: A Survey and Evaluation