Evaluating the Retrieval Robustness of Large Language Models

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the robustness of large language models (LLMs) in retrieval-augmented generation (RAG), addressing three critical questions: whether RAG consistently outperforms non-RAG baselines, whether increasing retrieved document count always improves performance, and whether document ordering affects generation quality. Method: We introduce the novel concept of *retrieval robustness*, construct a benchmark dataset of 1,500 open-domain questions, and propose three dedicated quantitative metrics. Using Wikipedia-based retrieval, we conduct large-scale controlled experiments across 11 LLMs and 3 prompting strategies. Contribution/Results: All models exhibit non-negligible retrieval robustness, yet remain sensitive to retrieval noise and document structure—severely limiting RAG’s practical gains. Our findings expose fundamental deployment bottlenecks in real-world RAG systems and provide both theoretical insights and empirical evidence for designing robust RAG architectures.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.
Problem

Research questions and friction points this paper is trying to address.

Assessing if RAG consistently outperforms non-RAG approaches
Examining if more retrieved documents guarantee better performance
Investigating how document order affects retrieval-augmented results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates LLM robustness in RAG setups
Introduces three retrieval robustness metrics
Tests 11 LLMs with 3 prompting strategies
🔎 Similar Papers
No similar papers found.
Shuyang Cao
Shuyang Cao
University of Michigan
Computational Linguistics
K
Karthik Radhakrishnan
Bloomberg
D
David Rosenberg
Bloomberg
S
Steven Lu
Bloomberg
Pengxiang Cheng
Pengxiang Cheng
Bloomberg LP
Natural Language ProcessingComputational Linguistics
L
Lu Wang
University of Michigan
S
Shiyue Zhang
Bloomberg