FedPS: Federated data Preprocessing via aggregated Statistics

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the lack of standardized data preprocessing methods in federated learning that simultaneously ensure privacy preservation and communication efficiency, particularly in handling missing values, inconsistent data formats, and heterogeneous feature scales. To this end, we propose FedPS, the first unified federated preprocessing framework that supports feature scaling, encoding, discretization, and missing value imputation, compatible with both horizontal and vertical federated settings. By leveraging data sketching techniques to efficiently summarize local statistics and integrating secure aggregation protocols, FedPS enables privacy-preserving preprocessing with low communication overhead. Furthermore, the framework extends algorithms such as k-Means, kNN, and Bayesian linear regression to the federated paradigm, significantly enhancing model performance and practicality on real-world heterogeneous datasets while maintaining consistency across participants.

Technology Category

Application Category

📝 Abstract
Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.
Problem

Research questions and friction points this paper is trying to address.

Federated Learning
Data Preprocessing
Privacy Preservation
Communication Efficiency
Statistical Aggregation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated Learning
Data Preprocessing
Aggregated Statistics
Data Sketching
Privacy-Preserving
🔎 Similar Papers
No similar papers found.