🤖 AI Summary
This work addresses key challenges in deploying large language models for retrieval-augmented generation (RAG), including high computational overhead, rapid knowledge obsolescence, and manual dependency in component selection. The authors propose a modular evaluation framework that, for the first time, directly links hardware constraints to RAG performance. By integrating resource telemetry with an automated recommendation mechanism, the framework efficiently identifies optimal combinations of components—including document chunking strategies, embedding models, vector databases, and retrievers—for domain-specific datasets. This approach maintains high generation quality while substantially reducing resource consumption. Designed to support rapid prototyping on consumer-grade hardware, the framework enables automatic, domain-tailored RAG configuration, achieving a favorable trade-off among accuracy, efficiency, and scalability.
📝 Abstract
Deploying Large Language Model (LLM) applications, particularly those relying on Retrieval-Augmented Generation (RAG), remains challenging due to high computational demands, outdated knowledge bases, and the need to manually select optimal pipeline components. In this work, we propose a modular framework for benchmarking and guiding the efficient development of RAG applications by focusing on resource telemetry and component recommendation, suggesting the best components for a domain-specific dataset. Our approach leverages core techniques in LLM applications, including document chunking, vector databases, embedding models, and retrievers, to evaluate trade-offs among accuracy, efficiency, and scalability. By directly correlating retrieval and generation quality with underlying hardware constraints, RAGe supports researchers to identify the most effective, domain-specific RAG setups for their specific operational needs, facilitating rapid prototyping even on consumer-grade hardware.