🤖 AI Summary
This study addresses the trade-off between privacy preservation and model utility in Retrieval-Augmented Generation (RAG) systems when handling personally identifiable information (PII). It presents the first systematic evaluation of how applying anonymization at different stages of the RAG pipeline—specifically at the input data versus the generated output—affects both privacy protection and task performance. Through quantitative analysis, the research demonstrates that the placement of anonymization significantly influences the privacy-utility balance: anonymizing at the input stage offers stronger privacy guarantees, whereas anonymization at the output stage better preserves the quality of generated text. These findings provide empirical evidence and practical design guidance for mitigating privacy risks in RAG systems without unduly compromising their functional effectiveness.
📝 Abstract
Despite the considerable promise of Retrieval-Augmented Generation (RAG), many real-world use cases may create privacy concerns, where the purported utility of RAG-enabled insights comes at the risk of exposing private information to either the LLM or the end user requesting the response. As a potential mitigation, using anonymization techniques to remove personally identifiable information (PII) and other sensitive markers in the underlying data represents a practical and sensible course of action for RAG administrators. Despite a wealth of literature on the topic, no works consider the placement of anonymization along the RAG pipeline, i.e., asking the question, where should anonymization happen? In this case study, we systematically and empirically measure the impact of anonymization at two important points along the RAG pipeline: the dataset and generated answer. We show that differences in privacy-utility trade-offs can be observed depending on where anonymization took place, demonstrating the significance of privacy risk mitigation placement in RAG.