RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

📅 2025-10-06

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work investigates the context robustness of external safety classifiers—such as Llama Guard and GPT-oss—when deployed in retrieval-augmented generation (RAG) settings, specifically examining how injected retrieval documents interfere with safety assessments. Through systematic evaluation, we find that embedding extraneous information in the RAG context increases misclassification rates for unsafe inputs and outputs by approximately 11% and 8%, respectively, substantially degrading classifier reliability. Attribution analysis reveals distinct, component-specific contributions to misclassification from retrieved documents, user queries, and model-generated responses. Two proposed mitigation strategies yield only marginal improvements, indicating fundamental limitations in current training and evaluation paradigms. To our knowledge, this is the first study to quantitatively characterize the contextual robustness threat posed by RAG to LLM safety classifiers. Our findings provide critical empirical evidence and concrete directions for developing context-aware safety mechanisms.

Technology Category

Application Category

📝 Abstract

With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.

Problem

Research questions and friction points this paper is trying to address.

Investigating robustness of LLM guardrails under RAG-style contexts

Testing how additional context alters guardrail safety judgments

Exposing context-robustness gap in current guardrail models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated guardrail robustness under RAG contexts

Tested mitigation methods with minor improvements

Proposed robust training and evaluation protocols

🔎 Similar Papers

Swiss Cheese Model for AI Safety: A Taxonomy and Reference Architecture for Multi-Layered Guardrails of Foundation Model Based Agents