RAGDoll: Efficient Offloading-based Online RAG System on a Single GPU

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address hardware idling, high latency, and low resource utilization caused by sequential retrieval and generation in RAG systems under single-GPU resource constraints, this paper proposes a decoupled, parallel retrieval-generation pipelining architecture. Our method introduces: (1) a joint memory placement and dynamic batch scheduling strategy for adaptive load balancing across CPU–GPU heterogeneous devices; and (2) a lightweight knowledge routing mechanism, hierarchical caching management, and dynamic retrieval batching—implemented via vLLM extension. Experiments demonstrate an average 3.6× latency reduction on a single GPU, while supporting multi-scale LLMs and large-scale knowledge bases. The approach significantly improves system throughput and GPU/CPU resource utilization without compromising accuracy or scalability.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) enhances large language model (LLM) generation quality by incorporating relevant external knowledge. However, deploying RAG on consumer-grade platforms is challenging due to limited memory and the increasing scale of both models and knowledge bases. In this work, we introduce RAGDoll, a resource-efficient, self-adaptive RAG serving system integrated with LLMs, specifically designed for resource-constrained platforms. RAGDoll exploits the insight that RAG retrieval and LLM generation impose different computational and memory demands, which in a traditional serial workflow result in substantial idle times and poor resource utilization. Based on this insight, RAGDoll decouples retrieval and generation into parallel pipelines, incorporating joint memory placement and dynamic batch scheduling strategies to optimize resource usage across diverse hardware devices and workloads. Extensive experiments demonstrate that RAGDoll adapts effectively to various hardware configurations and LLM scales, achieving up to 3.6 times speedup in average latency compared to serial RAG systems based on vLLM.

Problem

Research questions and friction points this paper is trying to address.

Efficient RAG deployment on memory-limited consumer platforms

Optimizing resource use by parallelizing retrieval and generation

Reducing latency in RAG systems with dynamic scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples retrieval and generation into parallel pipelines

Uses joint memory placement for optimization

Implements dynamic batch scheduling strategies

🔎 Similar Papers

No similar papers found.