COSMOS: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search

📅 2025-05-22
🏛️ IEEE computer architecture letters
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of high-throughput, low-latency approximate nearest neighbor search (ANNS) for billion-scale vector databases in retrieval-augmented generation (RAG) scenarios, this paper proposes the first fully in-memory ANNS system leveraging CXL memory. Our method innovatively integrates general-purpose compute cores directly into CXL memory devices to enable full hardware offloading of the ANNS pipeline; we further design a rank-level parallel distance computation architecture and a graph-structure-aware data placement strategy to jointly optimize computational efficiency and memory access locality. The system supports in-memory vector index construction and real-time search acceleration. Evaluated on SIFT1B and DEEP1B, it achieves 6.72× and 2.35× higher throughput than baseline and state-of-the-art CXL-based ANNS systems, respectively. This advancement significantly improves the scalability and end-to-end performance of RAG pipelines.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) is crucial for improving the quality of large language models by injecting proper contexts extracted from external sources. RAG requires high-throughput, low-latency Approximate Nearest Neighbor Search (ANNS) over billion-scale vector databases. Conventional DRAM/SSD solutions face capacity/latency limits, whereas specialized hardware or RDMA clusters lack flexibility or incur network overhead. We present Cosmos, integrating general-purpose cores within CXL memory devices for full ANNS offload and introducing rank-level parallel distance computation to maximize memory bandwidth. We also propose an adjacency-aware data placement that balances search loads across CXL devices based on inter-cluster proximity. Evaluations on SIFT1B and DEEP1B traces show that Cosmos achieves up to 6.72x higher throughput than the baseline CXL system and 2.35x over a state-of-the-art CXL-based solution, demonstrating scalability for RAG pipelines.
Problem

Research questions and friction points this paper is trying to address.

Enabling high-throughput ANNS for billion-scale vector databases
Overcoming DRAM/SSD capacity-latency limits in RAG systems
Optimizing CXL-based in-memory ANNS with parallel computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

CXL memory devices for full ANNS offload
Rank-level parallel distance computation
Adjacency-aware data placement for load balance
🔎 Similar Papers
No similar papers found.
S
Seoyoung Ko
Seoul National University, Seoul 08826, South Korea
H
Hyunjeong Shim
Seoul National University, Seoul 08826, South Korea
W
Wanju Doh
Seoul National University, Seoul 08826, South Korea
Sungmin Yun
Sungmin Yun
Seoul National University
Computer ArchitectureComputer SystemsDeep Learning
J
J. So
Y
Yongsuk Kwon
Samsung Electronics Corporation, Hwaseong-si, Gyeonggi-do 18448, South Korea
S
Sang-Soo Park
Samsung Electronics Corporation, Hwaseong-si, Gyeonggi-do 18448, South Korea
S
Si-Dong Roh
Samsung Electronics Corporation, Hwaseong-si, Gyeonggi-do 18448, South Korea
M
Minyong Yoon
Samsung Electronics Corporation, Hwaseong-si, Gyeonggi-do 18448, South Korea
T
Taeksang Song
Samsung Electronics Corporation, Hwaseong-si, Gyeonggi-do 18448, South Korea
J
Jung-Ho Ahn
Seoul National University, Seoul 08826, South Korea