Less LLM, More Documents: Searching for Improved RAG

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the challenge of improving retrieval-augmented generation (RAG) accuracy and cost-efficiency without scaling up language model (LM) parameters. We propose the “corpus–generator trade-off” principle, systematically demonstrating that expanding high-quality retrieval corpora significantly reduces reliance on large-parameter LMs. Through rigorous experimentation, we show that a medium-scale LM (e.g., 7B) augmented with a quadrupled corpus achieves performance on multiple open-domain QA benchmarks comparable to that of a much larger LM (e.g., 70B). The primary driver of improvement is increased coverage of answer-relevant passages; however, gains exhibit diminishing returns and saturate as corpus size grows. Crucially, this study provides the first quantitative characterization of corpus expansion as a viable alternative to model scaling—establishing its effectiveness, practical limits, and scalability properties. Our findings establish a new paradigm for lightweight, cost-effective RAG deployment grounded in corpus optimization rather than LM parameter growth.

Technology Category

Application Category

📝 Abstract

Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever's corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus-generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.

Problem

Research questions and friction points this paper is trying to address.

Reducing LLM reliance through corpus scaling

Exploring corpus-generator trade-off for RAG efficiency

Improving retrieval coverage to substitute model size

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enlarging retriever corpus to reduce LLM reliance

Corpus scaling substitutes for increasing model size

Larger corpora offer comparable gains to enlarging LLM

🔎 Similar Papers

No similar papers found.