Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of established benchmarks and severe class imbalance in named entity recognition (NER) for Portuguese clinical texts. It presents the first systematic evaluation of multilingual BERT variants—including BioBERTpt, BERTimbau, ModernBERT, and mmBERT—alongside state-of-the-art large language models such as GPT-5 and Gemini-2.5 on this task. To mitigate data imbalance, the authors integrate iterative stratified sampling, weighted loss functions, and oversampling strategies. Experimental results demonstrate that mmBERT-base achieves the best performance under resource-constrained conditions, attaining a micro F1-score of 0.76. The findings validate the effectiveness of the proposed approach and establish the first comprehensive benchmark and practical solution for Portuguese clinical NER.
📝 Abstract
Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.
Problem

Research questions and friction points this paper is trying to address.

Clinical Named Entity Recognition
Portuguese language
class imbalance
benchmark
multilabel NER
Innovation

Methods, ideas, or system contributions that make the work stand out.

clinical NER
Portuguese language
mmBERT
iterative stratification
class imbalance
🔎 Similar Papers
No similar papers found.
V
Vinicius Anjos de Almeida
Spesia, Curitiba - PR, Brazil; Faculdade de Medicina, Universidade de São Paulo, São Paulo - SP, Brazil
S
Sandro Saorin da Silva
Spesia, Curitiba - PR, Brazil
J
Josimar Chire
Spesia, Curitiba - PR, Brazil
L
Leonardo Vicenzi
Spesia, Curitiba - PR, Brazil
N
Nícolas Henrique Borges
Spesia, Curitiba - PR, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil
H
Helena Kociolek
Spesia, Curitiba - PR, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil; Universidade Federal do Paraná (UFPR), Curitiba - PR, Brazil
S
Sarah Miriã de Castro Rocha
Spesia, Curitiba - PR, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil; Universidade Federal do Paraná (UFPR), Curitiba - PR, Brazil
F
Frederico Nassif Gomes
Spesia, Curitiba - PR, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil
J
Júlia Cristina Ferreira
Spesia, Curitiba - PR, Brazil; Pontifícia Universidade Católica do Paraná (PUCPR), Curitiba - PR, Brazil
Oge Marques
Oge Marques
Affiliate Professor of Computer Science and Engineering, Florida Atlantic University
Artificial IntelligenceMedical Image Processing and AnalysisDeep LearningMachine Learning
Lucas Emanuel Silva e Oliveira
Lucas Emanuel Silva e Oliveira
Professor at PUCPR & Head of AI and Strategy at SpesIA
Natural Language ProcessingMachine LearningHealth InformaticsLarge Language Models