🤖 AI Summary
Existing RAG systems suffer from high end-to-end latency and poor tail latency control—particularly first-token time (TTFT)—due to the decoupled optimization of vector retrieval and LLM inference.
Method: We propose the first joint optimization framework for CPU-GPU heterogeneous systems. It leverages statistical modeling of query access skew to design a GPU-HBM memory-aware dynamic index partitioning scheme, and jointly schedules vector retrieval with LLM batching to achieve load balancing.
Contribution/Results: Our core innovation lies in co-optimizing index hotness clustering, GPU memory allocation, and LLM batch size—breaking away from conventional static partitioning. Experiments show a 2× improvement in vector search throughput and significant TTFT reduction, achieving 100% SLO compliance for user-defined latency targets.
📝 Abstract
Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases, enabling external knowledge retrieval to support language model reasoning. While RAG enables efficient question answering with smaller LLMs, existing optimizations for vector search and LLM serving have largely been developed in isolation. As a result, their integration often leads to suboptimal end-to-end performance. ... This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems that enhances the responsiveness of the system by jointly optimizing vector search and LLM serving across CPU and GPU system. A key challenge is to determine which indices and how much of the vector index should reside on the GPU and adjusting LLM batch sizes to balance the pipeline for lower Time-To-First-Token (TTFT) and meeting user-defined Service-Level Objectives (SLOs). To address this, we leverage the insight that cluster access in vector databases exhibits access skew, where a subset of clusters are queried significantly more frequently than others. VectorLiteRAG exploits this property through an optimized memory distribution strategy, dynamically allocating the minimum number of vector indices corresponding to frequently accessed clusters onto the GPU HBM to ensure a balanced pipeline with the LLM for high responsiveness. This adaptive partitioning scheme is guided by a statistical model that informs memory allocation and workload distribution. Our evaluation demonstrates that VectorLiteRAG improves vector search responsiveness by 2x, significantly reduces end-to-end TTFT in RAG systems by intelligently balancing memory resources between vector search and LLM execution.