Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

📅 2025-10-31
📈 Citations: 0
Influential: 0
📄 PDF

career value

157K/year
🤖 AI Summary
To address the scarcity of medical NLP tools for low-resource languages (Dutch, Romanian, Spanish), this work proposes a dual-track domain-adaptive pretraining framework targeting both clinical and general biomedical domains. Building upon multilingual BERT, we perform incremental pretraining on domain-specific corpora, followed by task-specific fine-tuning. Our key contribution is the first systematic empirical investigation into how domain granularity—clinical versus general biomedical—affects cross-lingual medical text understanding. Experiments demonstrate that domain adaptation substantially improves performance on patient auto-screening and biomedical named entity recognition. Clinically specialized models consistently outperform general biomedical ones across all languages, and robust cross-lingual transfer is observed. This study establishes a reproducible, generalizable pretraining paradigm for medical NLP in low-resource languages, advancing domain-aware multilingual biomedical language modeling.

Technology Category

Application Category

📝 Abstract
In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.
Problem

Research questions and friction points this paper is trying to address.

Evaluating domain adaptation for medical NLP tasks
Assessing cross-lingual transfer in low-resource languages
Improving model performance through specialized pre-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain adaptation via further pre-training on medical corpora
Cross-lingual transferability for low-resource language tasks
Fine-tuning on clinical notes for downstream medical NLP
🔎 Similar Papers
No similar papers found.
Y
Yinghao Luo
Department of Pathology & Clinical Bioinformatics, Erasmus University Medical Center Rotterdam
L
Lang Zhou
Department of Pathology & Clinical Bioinformatics, Erasmus University Medical Center Rotterdam
A
Amrish Jhingoer
Department of Pathology & Clinical Bioinformatics, Erasmus University Medical Center Rotterdam
K
Klaske Vliegenthart Jongbloed
Department of Internal Medicine, Erasmus University Medical Center Rotterdam
C
Carlijn Jordans
Department of Medical Microbiology & Infectious Diseases, Erasmus University Medical Center Rotterdam
B
Ben Werkhoven
Department of Data & Analytics, Erasmus University Medical Center Rotterdam
T
Tom Seinen
Department of Medical Informatics, Erasmus University Medical Center Rotterdam
Erik van Mulligen
Erik van Mulligen
Erasmus University Rotterdam
Text miningknowledge discoveryontologiesnatural language processing
C
Casper Rokx
Department of Internal Medicine, Erasmus University Medical Center Rotterdam
Y
Yunlei Li
Department of Pathology & Clinical Bioinformatics, Erasmus University Medical Center Rotterdam