Federated Online Learning for Heterogeneous Multisource Streaming Data

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses federated learning for distributed, multi-source, high-dimensional streaming data under privacy constraints. Methodologically, it proposes a federated online learning framework that accounts for data heterogeneity, dynamics, and differential privacy requirements. It introduces a subgroup structure assumption to capture latent similarities across data sources, integrates penalized reproducing kernel Hilbert space (RKHS) estimation with an efficient proximal gradient algorithm, and enables online model updates using only historical summary statistics. The approach achieves statistically optimal consistent estimation and exact subgroup structure recovery while satisfying differential privacy. Theoretical analysis guarantees convergence and consistent subgroup identification. Empirical evaluation on real-world streaming datasets—including financial lending and web log data—demonstrates significant improvements in prediction accuracy and personalization performance, alongside reduced communication and storage overhead.

Technology Category

Application Category

📝 Abstract
Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns. Most existing federated learning methods focus on the ``static" datasets. However, in many real-world applications, data arrive continuously over time, forming streaming datasets. This introduces additional challenges for data storage and algorithm design, particularly under high-dimensional settings. In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis. To account for heterogeneity, a personalized model is constructed for each data source, and a novel ``subgroup" assumption is employed to capture potential similarities, thereby enhancing model performance. We adopt the penalized renewable estimation method and the efficient proximal gradient descent for model training. The proposed method aligns with both federated and online learning frameworks: raw data are not exchanged among sources, ensuring data privacy, and only summary statistics of previous data batches are required for model updates, significantly reducing storage demands. Theoretically, we establish the consistency properties for model estimation, variable selection, and subgroup structure recovery, demonstrating optimal statistical efficiency. Simulations illustrate the effectiveness of the proposed method. Furthermore, when applied to the financial lending data and the web log data, the proposed method also exhibits advantageous prediction performance. Results of the analysis also provide some practical insights.
Problem

Research questions and friction points this paper is trying to address.

Federated online learning for heterogeneous streaming data analysis
Personalized models for multi-source data with subgroup similarities
Privacy-preserving distributed learning with reduced storage requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated online learning for streaming data
Personalized models with subgroup similarity assumption
Penalized renewable estimation with proximal gradient descent
🔎 Similar Papers
No similar papers found.
J
Jingmao Li
Department of Biostatistics, Yale School of Public Health
Y
Yuanxing Chen
Yau Mathematical Sciences Center, Tsinghua University
Shuangge Ma
Shuangge Ma
Yale University
Genetic epidemiologySurvival analysisCancerHealth economics
K
Kuangnan Fang
Department of Statistics and Data Science, School of Economics, Xiamen University