FedRW: Efficient Privacy-Preserving Data Reweighting for Enhancing Federated Learning of Language Models

📅 2025-11-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance degradation and privacy leakage caused by data duplication in federated learning, this paper proposes the first privacy-preserving soft deduplication framework that requires no trusted third party. Methodologically, it introduces a frequency-aware reweighting protocol built upon secure multi-party computation (MPC), integrating adaptive loss reweighting with a parallelized collaborative training architecture to enable global duplication estimation and sample-specific weighting—without uploading raw local samples. Key contributions include: (i) the first application of soft deduplication to federated language modeling, balancing generalization and privacy; (ii) decentralized frequency statistics via MPC, eliminating risks associated with centralized data cleaning; and (iii) a parallel scheduling mechanism that significantly improves scalability. Experiments demonstrate a 28.78× speedup in preprocessing over baseline methods, a 11.42% reduction in perplexity, and compliance with rigorous security guarantees.

Technology Category

Application Category

📝 Abstract
Data duplication within large-scale corpora often impedes large language models'(LLMs) performance and privacy. In privacy-concerned federated learning scenarios, conventional deduplication methods typically rely on trusted third parties to perform uniform deletion, risking loss of informative samples while introducing privacy vulnerabilities. To address these gaps, we propose Federated ReWeighting (FedRW), the first privacy-preserving framework, to the best of our knowledge, that performs soft deduplication via sample reweighting instead of deletion in federated LLM training, without assuming a trusted third party. At its core, FedRW proposes a secure, frequency-aware reweighting protocol through secure multi-party computation, coupled with a parallel orchestration strategy to ensure efficiency and scalability. During training, FedRW utilizes an adaptive reweighting mechanism with global sample frequencies to adjust individual loss contributions, effectively improving generalization and robustness. Empirical results demonstrate that FedRW outperforms the state-of-the-art method by achieving up to 28.78x speedup in preprocessing and approximately 11.42% improvement in perplexity, while offering enhanced security guarantees. FedRW thus establishes a new paradigm for managing duplication in federated LLM training.
Problem

Research questions and friction points this paper is trying to address.

Enhancing federated learning efficiency without trusted third parties
Addressing data duplication while preserving privacy in LLM training
Improving model generalization via secure sample reweighting instead of deletion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Privacy-preserving soft deduplication via sample reweighting
Secure frequency-aware reweighting using multi-party computation
Adaptive reweighting mechanism with global sample frequencies
🔎 Similar Papers
No similar papers found.
P
Pukang Ye
East China Normal University
Junwei Luo
Junwei Luo
Wuhan University
Vision-Language ModelOriented Object DetectionRemote Sensing
X
Xiaolei Dong
East China Normal University
Y
Yunbo Yang
East China Normal University