A Causal Language Modeling Detour Improves Encoder Continued Pretraining

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work proposes a “CLM detour” strategy to enhance the downstream performance of pretrained encoders in new domains such as biomedicine. The approach first applies intermediate continual pretraining using causal language modeling (CLM), followed by a brief phase of masked language modeling (MLM) decay. Analyses reveal that CLM provides dense supervision to the lower layers of the Transformer architecture, substantially improving representational capacity—an advantage that persists through subsequent training and scales effectively with model size. Evaluated on French and English biomedical tasks, the method yields average performance gains of 1.2–2.8 and 0.3–0.8 percentage points, respectively. The study also introduces ModernCamemBERT-bio and ModernBERT-bio, which currently stand as the strongest biomedical encoders for their respective languages.
📝 Abstract
When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.
Problem

Research questions and friction points this paper is trying to address.

Encoder Adaptation
Continued Pretraining
Biomedical Language Modeling
Downstream Performance
Domain Adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Language Modeling
Continued Pretraining
Encoder Adaptation
Dense Supervision
Biomedical Language Models