RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation

📅 2024-12-13

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the trade-off between response latency and generation quality in Retrieval-Augmented Generation (RAG) systems, this paper proposes the first framework jointly optimizing query-level scheduling and dynamic RAG configuration adaptation. Our method enables fine-grained latency–quality trade-off modeling at the query level, integrating online configuration decisions with real-time orchestration of the RAG workflow—coordinating key parameters such as retrieved chunk count and answer synthesis strategy. The core innovation lies in the tight integration of query-aware scheduling and adaptive RAG configuration, achieving substantial end-to-end latency reduction without compromising generation quality. Extensive experiments across four mainstream RAG-QA benchmarks demonstrate that our approach reduces average generation latency by 1.64×–2.54× compared to state-of-the-art methods, while preserving answer quality.

Technology Category

Application Category

📝 Abstract

RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by $1.64-2.54 imes$ without sacrificing generation quality.

Problem

Research questions and friction points this paper is trying to address.

Optimizes tradeoff between RAG response delay and quality

Adapts RAG configurations like retrieved chunks and synthesis methods

Reduces generation latency without sacrificing quality

Innovation

Methods, ideas, or system contributions that make the work stand out.

Jointly schedules queries and adapts configurations

Balances quality optimization and delay reduction

Reduces latency without sacrificing generation quality

🔎 Similar Papers

No similar papers found.