MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation

📅 2026-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of maintaining strong performance on low-resource languages—such as Catalan and Spanish—while simultaneously meeting the demanding requirements of specialized domains like biomedicine and law. Building upon the ModernBERT architecture, the authors develop multilingual encoders with 150M–300M parameters, pre-trained across 35 languages and code. They innovatively introduce Matryoshka Representation Learning (MRL) to multilingual domain-specific modeling for the first time. Through triple adaptation at the lexical, domain, and dimensional levels, the model supports flexible vector sizes, significantly reducing storage and inference costs without compromising linguistic fidelity. The resulting model achieves state-of-the-art results on Catalan and Spanish benchmarks and demonstrates exceptional generalization capabilities in high-stakes professional domains.

Technology Category

Application Category

📝 Abstract
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code. Through targeted adaptation, this model family achieves state-of-the-art results on Catalan- and Spanish-specific tasks, while establishing robust performance across specialized biomedical and legal domains. To bridge the gap between research and production, we incorporate Matryoshka Representation Learning (MRL), enabling flexible vector sizing that significantly reduces inference and storage costs. Ultimately, the MrBERT family demonstrates that modern encoder architectures can be optimized for both localized linguistic excellence and efficient, high-stakes domain specialization. We open source the complete model family on Huggingface.
Problem

Research questions and friction points this paper is trying to address.

multilingual encoders
domain specialization
linguistic excellence
inference efficiency
storage efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Matryoshka Representation Learning
multilingual adaptation
domain specialization
efficient inference
vocabulary adaptation
🔎 Similar Papers
No similar papers found.
Daniel Tamayo
Daniel Tamayo
Harvey Mudd College
Orbital DynamicsPlanetary ScienceChaos
I
Iñaki Lacunza
Barcelona Supercomputing Center
P
Paula Rivera-Hidalgo
Barcelona Supercomputing Center
S
Severino Da Dalt
Barcelona Supercomputing Center
J
Javier Aula-Blasco
Barcelona Supercomputing Center
Aitor Gonzalez-Agirre
Aitor Gonzalez-Agirre
Barcelona Supercomputing Center (BSC)
Artificial IntelligenceNatural Language ProcessingSemanticsDeep Learning
Marta Villegas
Marta Villegas
Barcelona Supercomputing Center
Natural Language Processing