🤖 AI Summary
This work addresses the long-standing incompatibility between causal language modeling (CLM), as embodied by GPT, and masked language modeling (MLM), as used in BERT—two paradigms traditionally requiring distinct architectures and training objectives. We propose a unified pretraining framework that jointly optimizes both MLM and CLM objectives within a standard Transformer architecture, enabling seamless coexistence and dynamic mode switching between generative and discriminative capabilities—without architectural modifications or task-specific adaptations. A novel hybrid loss function is introduced to balance the two objectives, and the model is pretrained on the BabyLM 2024 corpus. Empirical evaluation on the BabyLM Challenge 2024 demonstrates substantial gains over pure MLM or pure CLM baselines, marking the first instance of native, single-model support for both language modeling paradigms. All models, datasets, and code are publicly released.
📝 Abstract
We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.