🤖 AI Summary
To address the trade-off between response latency and generation quality in Retrieval-Augmented Generation (RAG) systems, this paper proposes the first framework jointly optimizing query-level scheduling and dynamic RAG configuration adaptation. Our method enables fine-grained latency–quality trade-off modeling at the query level, integrating online configuration decisions with real-time orchestration of the RAG workflow—coordinating key parameters such as retrieved chunk count and answer synthesis strategy. The core innovation lies in the tight integration of query-aware scheduling and adaptive RAG configuration, achieving substantial end-to-end latency reduction without compromising generation quality. Extensive experiments across four mainstream RAG-QA benchmarks demonstrate that our approach reduces average generation latency by 1.64×–2.54× compared to state-of-the-art methods, while preserving answer quality.
📝 Abstract
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, RAGServe reduces the generation latency by $1.64-2.54 imes$ without sacrificing generation quality.