BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

๐Ÿ“… 2025-06-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the weak domain adaptability, limited context length, and insufficient cross-source generalization of encoder-only Transformers in biomedical and clinical NLP, this work introduces the first high-performance long-context encoder specifically designed for this domain. Building upon the ModernBERT architecture, we conduct continual pretraining on an unprecedented scale of heterogeneous clinical textโ€”53.5 billion tokens drawn from 20 institutionally and geographically diverse datasetsโ€”and integrate long-context attention optimization with domain-adaptive tokenization. The resulting model achieves state-of-the-art performance across four major clinical downstream tasks, significantly enhancing long-text modeling capability while preserving inference efficiency. We publicly release both base (150M) and large (396M) variants, along with full training checkpoints, to foster reproducibility and further advancement in clinical language understanding.

Technology Category

Application Category

๐Ÿ“ Abstract
Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.
Problem

Research questions and friction points this paper is trying to address.

Improving biomedical and clinical NLP encoder performance
Addressing limited domain adaptation in clinical encoders
Enhancing long-context processing for biomedical texts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Domain-adapted encoder for biomedical NLP
Long-context processing with improved speed
Pretrained on diverse 53.5B token corpus
๐Ÿ”Ž Similar Papers
No similar papers found.
T
Thomas Sounack
Dana-Farber Cancer Institute
J
Joshua Davis
Dana-Farber Cancer Institute, Albany Medical College
B
Brigitte N. Durieux
Dana-Farber Cancer Institute, McGill University
Antoine Chaffin
Antoine Chaffin
LightOn
Intelligence Artificielle
T
Tom J. Pollard
Massachusetts Institute of Technology
E
Eric Lehman
OpenEvidence
A
Alistair E. W. Johnson
Microsoft
M
Matthew McDermott
Harvard Medical School
Tristan Naumann
Tristan Naumann
Principal Researcher, Microsoft Research Health Futures
Artificial IntelligenceMachine LearningNatural Language ProcessingClinical Inference
C
C. Lindvall
Dana-Farber Cancer Institute, Harvard Medical School