🤖 AI Summary
This study addresses key challenges in applying federated learning (FL) to cardiovascular disease risk prediction: stringent privacy requirements, high communication overhead, and severe inter-institutional class imbalance. To this end, it pioneers the efficient integration of nonparametric models—specifically random forests and XGBoost—into a medical FL framework. Three core innovations are proposed: (1) tree subset sampling to drastically reduce model transmission costs; (2) lightweight, XGBoost-based feature extraction enabling effective cross-institutional knowledge transfer; and (3) a synchronized federated SMOTE mechanism to mitigate local data imbalance. Evaluated on the Framingham Heart Study dataset, the federated XGBoost achieves an F1-score of 0.80—surpassing centralized training—while federated random forest attains 0.81, matching local training performance. Communication overhead is reduced by 3.2×, accuracy remains at 95%, and F1 improves by up to 15%. This work establishes a new paradigm for privacy-preserving, efficient, and scalable distributed medical prediction.
📝 Abstract
Cardiovascular diseases (CVD) cause over 17 million deaths annually worldwide, highlighting the urgent need for privacy-preserving predictive systems. We introduce FedCVD++, an enhanced federated learning (FL) framework that integrates both parametric models (logistic regression, SVM, neural networks) and non-parametric models (Random Forest, XGBoost) for coronary heart disease risk prediction. To address key FL challenges, we propose: (1) tree-subset sampling that reduces Random Forest communication overhead by 70%, (2) XGBoost-based feature extraction enabling lightweight federated ensembles, and (3) federated SMOTE synchronization for resolving cross-institutional class imbalance.
Evaluated on the Framingham dataset (4,238 records), FedCVD++ achieves state-of-the-art results: federated XGBoost (F1 = 0.80) surpasses its centralized counterpart (F1 = 0.78), and federated Random Forest (F1 = 0.81) matches non-federated performance. Additionally, our communication-efficient strategies reduce bandwidth consumption by 3.2X while preserving 95% accuracy.
Compared to existing FL frameworks, FedCVD++ delivers up to 15% higher F1-scores and superior scalability for multi-institutional deployment. This work represents the first practical integration of non-parametric models into federated healthcare systems, providing a privacy-preserving solution validated under real-world clinical constraints.