Should We Still Pretrain Encoders with Masked Language Modeling?

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work investigates the relative merits of masked language modeling (MLM) versus causal language modeling (CLM) for text representation pretraining in the large-model era. Method: We conduct a large-scale, controlled empirical study across 30 models spanning 210M to 1B parameters, involving over 15,000 fine-tuning evaluations. We further propose a novel two-stage “CLM→MLM” training strategy and assess low-cost adaptation of existing CLM models to MLM objectives. Contribution/Results: We find that MLM consistently achieves superior downstream performance, while CLM exhibits higher data efficiency and fine-tuning stability. The two-stage strategy attains optimal performance under fixed compute budgets, and CLM-to-MLM adaptation significantly reduces training cost. Crucially, MLM’s advantage stems not from confounding factors such as scale or dataset bias, but from the intrinsic suitability of its objective for representation learning; CLM’s value lies in training efficiency and robustness—indicating complementarity, not mutual exclusivity, between the two paradigms.

Technology Category

Application Category

📝 Abstract

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 30 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models (from the existing LLM ecosystem), reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

Problem

Research questions and friction points this paper is trying to address.

Comparing MLM and CLM for encoder pretraining effectiveness

Assessing CLM's data-efficiency and fine-tuning stability advantages

Proposing biphasic CLM-MLM training for optimal encoder performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Compares MLM and CLM for encoder pretraining

Proposes biphasic CLM then MLM training strategy

Leverages existing pretrained CLM models for efficiency

🔎 Similar Papers

No similar papers found.