Less LLM, More Documents: Searching for Improved RAG

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of improving retrieval-augmented generation (RAG) accuracy and cost-efficiency without scaling up language model (LM) parameters. We propose the “corpus–generator trade-off” principle, systematically demonstrating that expanding high-quality retrieval corpora significantly reduces reliance on large-parameter LMs. Through rigorous experimentation, we show that a medium-scale LM (e.g., 7B) augmented with a quadrupled corpus achieves performance on multiple open-domain QA benchmarks comparable to that of a much larger LM (e.g., 70B). The primary driver of improvement is increased coverage of answer-relevant passages; however, gains exhibit diminishing returns and saturate as corpus size grows. Crucially, this study provides the first quantitative characterization of corpus expansion as a viable alternative to model scaling—establishing its effectiveness, practical limits, and scalability properties. Our findings establish a new paradigm for lightweight, cost-effective RAG deployment grounded in corpus optimization rather than LM parameter growth.

Technology Category

Application Category

📝 Abstract
Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever's corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus-generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.
Problem

Research questions and friction points this paper is trying to address.

Reducing LLM reliance through corpus scaling
Exploring corpus-generator trade-off for RAG efficiency
Improving retrieval coverage to substitute model size
Innovation

Methods, ideas, or system contributions that make the work stand out.

Enlarging retriever corpus to reduce LLM reliance
Corpus scaling substitutes for increasing model size
Larger corpora offer comparable gains to enlarging LLM
🔎 Similar Papers
No similar papers found.
J
Jingjie Ning
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Y
Yibo Kong
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Y
Yunfan Long
School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
Jamie Callan
Jamie Callan
Professor, Language Technologies Institute, Carnegie Mellon University
Information retrievaltext mining