🤖 AI Summary
This study addresses the challenges of non-IID data and system heterogeneity in real-world clinical deployments of federated learning by proposing an embedded federated learning framework for collaborative training of an iron deficiency prediction model across two heterogeneous healthcare institutions. The approach leverages a frozen hematology foundation model, DeepCBC, to extract local representations, while only a lightweight downstream classifier undergoes federated training. It integrates a personalized aggregation strategy, FedMAP, and a healthcare-specific runtime governance platform, FLA³, which supports policy-based authorization, scoped execution, and audit logging. This work presents the first real-world deployment of embedded federated learning in a heterogeneous clinical setting, achieving state-of-the-art performance: FedMAP attains ROC-AUC scores of 0.9594 and 0.8671 at AUMC and NHSBT, respectively, with a macro ROC-AUC of 0.9133—significantly outperforming local training baselines.
📝 Abstract
Recent reviews find that the vast majority of published healthcare federated learning (FL) studies never reach real-world deployment. We developed an embedding-based FL pipeline for iron deficiency prediction from routine full blood count (FBC) data and deployed it across real institutional environments at Amsterdam University Medical Centre (AUMC) and NHS Blood and Transplant (NHSBT), two clinical environments that differ markedly in iron deficiency prevalence, ferritin distribution, and subject populations. A frozen domain-specific haematology foundation model, DeepCBC, performs site-local representation extraction, restricting federated training to a compact downstream classifier and substantially reducing recurrent communication relative to full-encoder federation. The two clinical datasets are structurally not independent and identically distributed (non-IID), with heterogeneity arising from distinct population differences rather than sampling artefacts. Runtime governance is enforced by FLA$^3$, a healthcare-oriented FL platform providing study-scoped execution, policy-based authorisation, and signed audit logging. Standard sample-size-weighted aggregation (FedAvg) reduced the area under the receiver operating characteristic curve (ROC-AUC) at both sites relative to local-only training, as the global update was biased towards the larger AUMC distribution. FedMAP, a personalised aggregation method, raised ROC-AUC from 0.9470 to 0.9594 at AUMC and from 0.8558 to 0.8671 at NHSBT relative to local-only training, achieving the highest macro ROC-AUC of 0.9133 and the best macro balanced accuracy overall. These results support personalised aggregation in clinical federations where client sample size and task relevance diverge substantially.