A Causal Language Modeling Detour Improves Encoder Continued Pretraining

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work proposes a “CLM detour” strategy to enhance the downstream performance of pretrained encoders in new domains such as biomedicine. The approach first applies intermediate continual pretraining using causal language modeling (CLM), followed by a brief phase of masked language modeling (MLM) decay. Analyses reveal that CLM provides dense supervision to the lower layers of the Transformer architecture, substantially improving representational capacity—an advantage that persists through subsequent training and scales effectively with model size. Evaluated on French and English biomedical tasks, the method yields average performance gains of 1.2–2.8 and 0.3–0.8 percentage points, respectively. The study also introduces ModernCamemBERT-bio and ModernBERT-bio, which currently stand as the strongest biomedical encoders for their respective languages.

📝 Abstract

When adapting an encoder to a new domain, the standard approach is to continue training with Masked Language Modeling (MLM). We show that temporarily switching to Causal Language Modeling (CLM) followed by a short MLM decay improves downstream performance. On biomedical texts with ModernBERT, this CLM detour outperforms MLM baselines trained on identical data and compute across 8 French and 11 English biomedical tasks, by +1.2-2.8pp and +0.3-0.8pp respectively, depending on model size. We investigate the reasons for these gains. We find that CLM's dense supervision impacts low transformer layers (0-7) far more than MLM does. Freezing low layers during CLM eliminates the downstream benefit; freezing mid layers preserves it. The representational changes persist through the MLM decay phase, even when it matches the CLM phase in length, and they scale with model capacity. We release ModernCamemBERT-bio and ModernBERT-bio as state-of-the-art biomedical encoders in Base and Large sizes.

Problem

Research questions and friction points this paper is trying to address.

Encoder Adaptation

Continued Pretraining

Biomedical Language Modeling

Downstream Performance

Domain Adaptation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Language Modeling

Continued Pretraining

Encoder Adaptation