Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation

πŸ“… 2025-04-27
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the dual challenges of domain data scarcity and privacy leakage in private Retrieval-Augmented Generation (RAG) systems, this paper proposes FedE4RAGβ€”a novel end-to-end privacy-preserving framework for training retrieval models. FedE4RAG integrates federated learning, knowledge distillation, and homomorphic encryption to enable clients to collaboratively optimize embedded retrieval models locally, ensuring that raw data and gradients never leave their respective domains. It introduces the first application of knowledge distillation in federated RAG to enhance local model generalization, while leveraging homomorphic encryption to provide cryptographic-strength privacy guarantees at the parameter level. Extensive experiments on real-world datasets demonstrate that FedE4RAG significantly improves retrieval accuracy in private RAG settings, while rigorously preventing leakage of both sensitive data and model parameters.

Technology Category

Application Category

πŸ“ Abstract
Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution for enhancing the accuracy and credibility of Large Language Models (LLMs), particularly in Question&Answer tasks. This is achieved by incorporating proprietary and private data from integrated databases. However, private RAG systems face significant challenges due to the scarcity of private domain data and critical data privacy issues. These obstacles impede the deployment of private RAG systems, as developing privacy-preserving RAG systems requires a delicate balance between data security and data availability. To address these challenges, we regard federated learning (FL) as a highly promising technology for privacy-preserving RAG services. We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG). This framework facilitates collaborative training of client-side RAG retrieval models. The parameters of these models are aggregated and distributed on a central-server, ensuring data privacy without direct sharing of raw data. In FedE4RAG, knowledge distillation is employed for communication between the server and client models. This technique improves the generalization of local RAG retrievers during the federated learning process. Additionally, we apply homomorphic encryption within federated learning to safeguard model parameters and mitigate concerns related to data leakage. Extensive experiments conducted on the real-world dataset have validated the effectiveness of FedE4RAG. The results demonstrate that our proposed framework can markedly enhance the performance of private RAG systems while maintaining robust data privacy protection.
Problem

Research questions and friction points this paper is trying to address.

Addresses data scarcity in private RAG systems
Balances data security and availability in RAG
Ensures privacy in federated learning for RAG
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated learning for privacy-preserving RAG services
Knowledge distillation for server-client model communication
Homomorphic encryption to protect model parameters
πŸ”Ž Similar Papers
No similar papers found.
Qianren Mao
Qianren Mao
Zhongguancun Laboratory
Text miningText GenerationKnowledge Graph and Reasoing
Q
Qili Zhang
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
H
Hanwen Hao
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Z
Zhentao Han
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Runhua Xu
Runhua Xu
Beihang University | former RSM@IBM Research
privacy-enhancing tech.security/privacy in AI/MLapplied cryptoblockchain
W
Weifeng Jiang
Nanyang Technological University (NTU), 639798, Singapore
Qi Hu
Qi Hu
University of Maryland, College Park
fast multipole methodsscientific computingGPGPUHPC
Zhijun Chen
Zhijun Chen
Beihang University
Machine LearningNature Language Processing
T
Tyler Zhou
Beijing Academy of Blockchain and Edge Computing (BAEC), Beijing, China
B
Bo Li
School of Computer Science and Engineering, Beihang University, Beijing 100191, China; Zhongguancun Laboratory
Yangqiu Song
Yangqiu Song
HKUST
Artificial IntelligenceData MiningNatural Language ProcessingKnowledge GraphsCommonsense Reasoning
J
Jin Dong
Beijing Academy of Blockchain and Edge Computing; Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beijing, China
J
Jianxin Li
School of Computer Science and Engineering, Beihang University, Beijing 100191, China; Zhongguancun Laboratory
Philip S. Yu
Philip S. Yu
Professor of Computer Science, University of Illinons at Chicago
Data miningDatabasePrivacy