Multilingual BERT language model for medical tasks: Evaluation on domain-specific adaptation and cross-linguality

📅 2025-10-31

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the scarcity of medical NLP tools for low-resource languages (Dutch, Romanian, Spanish), this work proposes a dual-track domain-adaptive pretraining framework targeting both clinical and general biomedical domains. Building upon multilingual BERT, we perform incremental pretraining on domain-specific corpora, followed by task-specific fine-tuning. Our key contribution is the first systematic empirical investigation into how domain granularity—clinical versus general biomedical—affects cross-lingual medical text understanding. Experiments demonstrate that domain adaptation substantially improves performance on patient auto-screening and biomedical named entity recognition. Clinically specialized models consistently outperform general biomedical ones across all languages, and robust cross-lingual transfer is observed. This study establishes a reproducible, generalizable pretraining paradigm for medical NLP in low-resource languages, advancing domain-aware multilingual biomedical language modeling.

Technology Category

Application Category

📝 Abstract

In multilingual healthcare applications, the availability of domain-specific natural language processing(NLP) tools is limited, especially for low-resource languages. Although multilingual bidirectional encoder representations from transformers (BERT) offers a promising motivation to mitigate the language gap, the medical NLP tasks in low-resource languages are still underexplored. Therefore, this study investigates how further pre-training on domain-specific corpora affects model performance on medical tasks, focusing on three languages: Dutch, Romanian and Spanish. In terms of further pre-training, we conducted four experiments to create medical domain models. Then, these models were fine-tuned on three downstream tasks: Automated patient screening in Dutch clinical notes, named entity recognition in Romanian and Spanish clinical notes. Results show that domain adaptation significantly enhanced task performance. Furthermore, further differentiation of domains, e.g. clinical and general biomedical domains, resulted in diverse performances. The clinical domain-adapted model outperformed the more general biomedical domain-adapted model. Moreover, we observed evidence of cross-lingual transferability. Moreover, we also conducted further investigations to explore potential reasons contributing to these performance differences. These findings highlight the feasibility of domain adaptation and cross-lingual ability in medical NLP. Within the low-resource language settings, these findings can provide meaningful guidance for developing multilingual medical NLP systems to mitigate the lack of training data and thereby improve the model performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating domain adaptation for medical NLP tasks

Assessing cross-lingual transfer in low-resource languages

Improving model performance through specialized pre-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain adaptation via further pre-training on medical corpora

Cross-lingual transferability for low-resource language tasks

Fine-tuning on clinical notes for downstream medical NLP

🔎 Similar Papers

No similar papers found.