Privacy-Preserving Retrieval Augmented Generation with Differential Privacy

📅 2024-12-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Retrieval-augmented generation (RAG) systems risk leaking sensitive information during external knowledge retrieval. Method: This paper introduces the first rigorous integration of differential privacy (DP) into the end-to-end RAG generation pipeline, proposing an adaptive privacy budget allocation strategy: DP noise is injected *only* into tokens whose generation depends on sensitive contextual information, while all other tokens are generated losslessly by the base large language model—yielding a hybrid privacy-preserving generation architecture. Contribution/Results: Our approach overcomes the accuracy bottleneck inherent in DP-based long-text generation. Under a practical privacy budget (ε ≈ 10), it significantly outperforms non-RAG baselines across multiple LLMs and datasets, achieving simultaneous high fidelity and strong privacy guarantees. It constitutes the first solution for trustworthy RAG deployment in sensitive scenarios that is both theoretically rigorous and practically viable.

Technology Category

Application Category

📝 Abstract

With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval-augmented generation (RAG) is particularly effective -- it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $epsilonapprox 10$ across different models and datasets.

Problem

Research questions and friction points this paper is trying to address.

Privacy in retrieval-augmented generation

Differential privacy for sensitive data

Efficient privacy budget utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential Privacy in RAG

Selective Privacy Budget Allocation

Enhanced Privacy for Sensitive Data

🔎 Similar Papers

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data

2024-06-20arXiv.orgCitations: 3

Authors to Follow