Investigating the Robustness of Retrieval-Augmented Generation at the Query Level

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This paper identifies a critical robustness deficiency in retrieval-augmented generation (RAG) systems at the query level: mainstream systems exhibit high sensitivity to minor query perturbations—including misspellings, synonym substitutions, and syntactic transformations—leading to substantial degradation in both retrieval and generation performance. To address this, the authors introduce the first systematic evaluation framework specifically designed for query-level robustness, decoupling and quantifying the sensitivity of retrievers, rerankers, and generators. They conduct 1,092 end-to-end experiments across 10 diverse general-domain and domain-specific datasets. Empirical analysis reveals that the retriever constitutes the primary robustness bottleneck, whereas reranking significantly mitigates perturbation effects. Grounded in these findings, the paper proposes three principled design guidelines—query normalization, joint retriever-reranker optimization, and perturbation-aware fine-tuning—and provides reproducible implementation recommendations. This work establishes both theoretical insights and practical foundations for developing robust RAG systems.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed.

Problem

Research questions and friction points this paper is trying to address.

Assessing RAG system sensitivity to query variations

Evaluating retriever performance degradation under query perturbations

Proposing framework for query-level robustness evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic external knowledge integration via RAG

Query perturbation analysis for robustness

Evaluation framework for RAG pipeline assessment

🔎 Similar Papers

No similar papers found.