🤖 AI Summary
This study investigates the feasibility and performance of federated learning for multi-label ICD-9/ICD-10 coding classification on MIMIC-IV clinical notes in distributed healthcare settings. We propose a lightweight federated framework that freezes pre-trained text embeddings—aggregating representations from six publicly available models—and trains only small, local MLPs (three architectures), thereby ensuring privacy preservation and computational efficiency. Ablation studies reveal that embedding quality exerts a substantially greater impact on performance than model complexity. Without sharing raw data, our federated approach achieves near-centralized performance: +0.8% micro-F1 and −1.2% macro-F1 relative to centralized training. Model size is reduced by an order of magnitude compared to state-of-the-art methods, validating the efficacy and robustness of the “high-quality embeddings + lightweight local models” paradigm for medical NLP in federated learning. This work provides a scalable, regulatory-compliant pathway toward clinical coding systems.
📝 Abstract
This study investigates the feasibility and performance of federated learning (FL) for multi-label ICD code classification using clinical notes from the MIMIC-IV dataset. Unlike previous approaches that rely on centralized training or fine-tuned large language models, we propose a lightweight and scalable pipeline combining frozen text embeddings with simple multilayer perceptron (MLP) classifiers. This design offers a privacy-preserving and deployment-efficient alternative for clinical NLP applications, particularly suited to distributed healthcare settings. Extensive experiments across both centralized and federated configurations were conducted, testing six publicly available embedding models from Massive Text Embedding Benchmark leaderboard and three MLP classifier architectures under two medical coding (ICD-9 and ICD-10). Additionally, ablation studies over ten random stratified splits assess performance stability. Results show that embedding quality substantially outweighs classifier complexity in determining predictive performance, and that federated learning can closely match centralized results in idealized conditions. While the models are orders of magnitude smaller than state-of-the-art architectures and achieved competitive micro and macro F1 scores, limitations remain including the lack of end-to-end training and the simplified FL assumptions. Nevertheless, this work demonstrates a viable way toward scalable, privacy-conscious medical coding systems and offers a step toward for future research into federated, domain-adaptive clinical AI.