🤖 AI Summary
This work addresses the lack of standardized data preprocessing methods in federated learning that simultaneously ensure privacy preservation and communication efficiency, particularly in handling missing values, inconsistent data formats, and heterogeneous feature scales. To this end, we propose FedPS, the first unified federated preprocessing framework that supports feature scaling, encoding, discretization, and missing value imputation, compatible with both horizontal and vertical federated settings. By leveraging data sketching techniques to efficiently summarize local statistics and integrating secure aggregation protocols, FedPS enables privacy-preserving preprocessing with low communication overhead. Furthermore, the framework extends algorithms such as k-Means, kNN, and Bayesian linear regression to the federated paradigm, significantly enhancing model performance and practicality on real-world heterogeneous datasets while maintaining consistency across participants.
📝 Abstract
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.