RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

📅 2025-03-18

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Performance fragmentation in Retrieval-Augmented Generation (RAG) services arises from the proliferation of algorithmic variants and heterogeneous workloads. Method: This paper introduces RAGSchema, a unified abstraction that systematically characterizes structural diversity across RAG algorithms; building upon it, we propose RAGO—a system-level co-optimization framework supporting workload-aware scheduling, joint optimization of LLMs and retrieval modules, and low-latency token-stream scheduling. Contribution/Results: RAGO is the first adaptive inference serving framework designed for multiple RAG variants, enabling cross-algorithm and cross-workload optimization. Experimental evaluation on single-chip deployment demonstrates that RAGO achieves 2× higher queries-per-second (QPS) and reduces first-token latency by 55% compared to state-of-the-art LLM-serving extensions.

Technology Category

Application Category

📝 Abstract

Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.

Problem

Research questions and friction points this paper is trying to address.

Optimizing performance for retrieval-augmented generation (RAG) systems.

Addressing variability in RAG workload performance.

Developing a framework for efficient RAG serving.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RAGSchema for RAG algorithm abstraction

Analyzes performance variability in RAG workloads

Proposes RAGO for efficient RAG serving optimization

🔎 Similar Papers

No similar papers found.