๐ค AI Summary
Performance fragmentation in Retrieval-Augmented Generation (RAG) services arises from the proliferation of algorithmic variants and heterogeneous workloads. Method: This paper introduces RAGSchema, a unified abstraction that systematically characterizes structural diversity across RAG algorithms; building upon it, we propose RAGOโa system-level co-optimization framework supporting workload-aware scheduling, joint optimization of LLMs and retrieval modules, and low-latency token-stream scheduling. Contribution/Results: RAGO is the first adaptive inference serving framework designed for multiple RAG variants, enabling cross-algorithm and cross-workload optimization. Experimental evaluation on single-chip deployment demonstrates that RAGO achieves 2ร higher queries-per-second (QPS) and reduces first-token latency by 55% compared to state-of-the-art LLM-serving extensions.
๐ Abstract
Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.