HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse

📅 2025-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Retrieval-augmented generation (RAG) re-rankers improve answer quality but incur substantial computational overhead, severely limiting system throughput and increasing latency. To address this, we propose the first KV cache reuse mechanism specifically designed for the RAG re-ranking stage, coupled with a decoder-only lightweight re-ranker and system-level co-optimizations—including batched inference, memory reuse, and fine-grained computation scheduling. Our core innovation lies in cross-stage KV cache sharing across retrieval, re-ranking, and generation phases, enabling reuse of document-side KV representations and eliminating redundant encoding. Experiments demonstrate that our approach achieves 2–3× end-to-end throughput improvement and significantly reduces P99 latency, without compromising re-ranking accuracy or downstream task performance. This work establishes a new paradigm for deploying high-quality, high-efficiency RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) by integrating external knowledge into the generation process. A key component of RAG pipelines is the reranker, which selects the most relevant documents from a pool of retrieved candidates and significantly improves the quality of the generated responses. While rerankers refine the selection of retrieved documents in RAG pipelines, they introduce computational challenges that hinder high throughput and low latency. To address this problem, we propose HyperRAG, a system that optimizes the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. By reusing document-side KV-cache, HyperRAG achieves both high-quality generation and system-level efficiency. To fully realize the benefits of KV-cache reuse, HyperRAG incorporates a range of system-level optimizations designed to enhance efficiency and scalability. Experiments show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
Problem

Research questions and friction points this paper is trying to address.

Optimizing quality-efficiency tradeoffs in RAG pipelines
Reducing computational challenges from reranker usage
Improving throughput and latency in RAG systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

KV-cache reuse for efficient reranker inference
System-level optimizations for enhanced efficiency
Decoder-only rerankers improve throughput and performance
🔎 Similar Papers
No similar papers found.