🤖 AI Summary
This paper identifies a systemic security degradation effect of Retrieval-Augmented Generation (RAG) on large language models (LLMs)—a phenomenon persisting even when both the LLM and retrieved documents are explicitly safety-aligned. Through multi-dimensional safety evaluations across 11 mainstream LLMs under RAG and non-RAG configurations, the study empirically demonstrates that RAG increases the average probability of harmful outputs and induces structural shifts in risk distribution. Moreover, standard red-teaming methodologies exhibit a 37% average decline in detection efficacy within RAG settings, revealing critical transfer failure. The contributions are threefold: (1) the first systematic empirical evidence that RAG can *undermine*, rather than enhance, LLM safety; (2) identification of a RAG-specific vulnerability wherein “safe model + safe context ≠ safe output”; and (3) a call for—and foundational rationale toward—developing RAG-native safety evaluation paradigms.
📝 Abstract
Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming. However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model's safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.