🤖 AI Summary
In Retrieval-Augmented Generation (RAG), the retrieval stage incurs substantial data movement overhead due to large-scale vector similarity search. Existing in-storage processing (ISP) acceleration approaches suffer from algorithmic mismatch, suboptimal data loading, and reliance on hardware modifications. This paper proposes the first RAG-optimized in-storage retrieval system. We design an ISP-native approximate nearest neighbor search (ANNS) algorithm, construct a tightly coupled joint index of embeddings and documents with a cross-plane parallel database layout, and introduce a lightweight flash translation layer (FTL) adaptation mechanism that requires no hardware modification. Experiments demonstrate that our system achieves 13× higher retrieval throughput and 55× better energy efficiency compared to a high-end CPU server, outperforming prior ISP-ANNS solutions. It also exhibits strong compatibility and deployment simplicity.
📝 Abstract
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. This limitation, combined with the significant cost of retraining renders them incapable of providing up-to-date responses. To overcome these issues, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: (i) indexing, which creates a database that facilitates similarity search on text embeddings, (ii) retrieval, which, given a user query, searches and retrieves relevant data from the database and (iii) generation, which uses the user query and the retrieved data to generate a response. The retrieval stage of RAG in particular becomes a significant performance bottleneck in inference pipelines. In this stage, (i) a given user query is mapped to an embedding vector and (ii) an Approximate Nearest Neighbor Search (ANNS) algorithm searches for the most semantically similar embedding vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS workloads by performing computations inside the storage system. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications to the storage system, limiting performance and hindering their adoption. We propose REIS, the first Retrieval system tailored for RAG with In-Storage processing that addresses the limitations of existing implementations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored algorithm and data placement technique that: (i) distributes embeddings across all planes of the storage system to exploit parallelism, and (ii) employs a lightweight Flash Translation Layer (FTL) to improve performance. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system, without requiring hardware modifications. The three key mechanisms form a cohesive framework that largely enhances both the performance and energy efficiency of RAG pipelines. Compared to a high-end server-grade system, REIS improves the performance (energy efficiency) of the retrieval stage by an average of 13 × (55 ×). REIS offers improved performance against existing ISP-based ANNS accelerators, without introducing any hardware modifications, enabling easier adoption for RAG pipelines.