GPT or BERT: why not both?

📅 2024-10-31

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work addresses the long-standing incompatibility between causal language modeling (CLM), as embodied by GPT, and masked language modeling (MLM), as used in BERT—two paradigms traditionally requiring distinct architectures and training objectives. We propose a unified pretraining framework that jointly optimizes both MLM and CLM objectives within a standard Transformer architecture, enabling seamless coexistence and dynamic mode switching between generative and discriminative capabilities—without architectural modifications or task-specific adaptations. A novel hybrid loss function is introduced to balance the two objectives, and the model is pretrained on the BabyLM 2024 corpus. Empirical evaluation on the BabyLM Challenge 2024 demonstrates substantial gains over pure MLM or pure CLM baselines, marking the first instance of native, single-model support for both language modeling paradigms. All models, datasets, and code are publicly released.

Technology Category

Application Category

📝 Abstract

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

Problem

Research questions and friction points this paper is trying to address.

GPT

BERT

Natural Language Processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-BERT

Masked and Sequential Prediction

Open-sourced Model and Code

🔎 Similar Papers

No similar papers found.