Drop Dropout on Single-Epoch Language Model Pretraining

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

136K/year

🤖 AI Summary

This work investigates the necessity of dropout in single-epoch large language model (LLM) pretraining. Motivated by reduced overfitting risk in modern single-pass LLM pretraining, we systematically evaluate dropout’s impact on both downstream performance and editability. Using BERT and Pythia (160M/1.4B) under masked language modeling and autoregressive pretraining, we assess language modeling, syntactic understanding (BLiMP), question answering (SQuAD), natural language inference (MNLI), and model editability (MEND/ReFT). Results show that dropout consistently degrades performance across all downstream tasks—neither standard nor “early” dropout improves outcomes over zero dropout. Removing dropout yields uniform gains in task accuracy and substantially increases MEND editing success, while ReFT performance remains comparable. This study provides the first empirical evidence that dropout exerts uniformly detrimental effects in single-epoch pretraining, demonstrating its regularization redundancy and revealing that dropout-free models admit more effective gradient-based editing.

Technology Category

Application Category

📝 Abstract

Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in language modeling, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced"early dropout"also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.

Problem

Research questions and friction points this paper is trying to address.

Investigates dropout's role in single-epoch LM pretraining

Compares dropout impact on masked and autoregressive LMs

Assesses dropout effects on downstream tasks and model editability

Innovation

Methods, ideas, or system contributions that make the work stand out.

No dropout improves downstream performance

Early dropout degrades model performance

Dropout-free models enhance gradient-based editing

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models