Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

📅 2025-08-03
📈 Citations: 0
✹ Influential: 0
📄 PDF
đŸ€– AI Summary
To address catastrophic forgetting—performance degradation on previously learned tasks due to distribution shift induced by new data during continual pretraining of large language models (LLMs)—this paper proposes an efficient continual pretraining method integrating experience replay with gradient alignment. We are the first to empirically validate the effectiveness of gradient alignment in LLM pretraining and introduce Meta Experience Replay (MER), a low-overhead implementation that minimizes computational cost. Evaluated across multilingual and multitask settings, our approach significantly mitigates forgetting. Experiments on datasets comprising tens of billions of tokens demonstrate that replaying only 1% of historical data achieves performance gains comparable to those from scaling model parameters—while ensuring training stability and computational efficiency. This work establishes a scalable, practical paradigm for sustainable LLM pretraining.

Technology Category

Application Category

📝 Abstract
Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.
Problem

Research questions and friction points this paper is trying to address.

Addressing distribution shifts in continual LLM pre-training
Evaluating replay and gradient alignment for stable learning
Optimizing compute efficiency in model scaling vs replay rates
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses replay and gradient alignment techniques
Proposes efficient meta-experience replay implementation
Scales model size over high replay rates
🔎 Similar Papers
No similar papers found.
I
Istabrak Abbes
UniversitĂ© de MontrĂ©al, Mila – Quebec AI Institute, Chandar Research Lab
G
Gopeshh Subbaraj
UniversitĂ© de MontrĂ©al, Mila – Quebec AI Institute
Matthew Riemer
Matthew Riemer
IBM, Mila
Artificial IntelligenceDeep LearningMachine Learning
Nizar Islah
Nizar Islah
Université de Montréal, Mila
Continual Learning
B
Benjamin Therien
UniversitĂ© de MontrĂ©al, Mila – Quebec AI Institute
T
Tsuguchika Tabaru
Fujitsu Research
H
Hiroaki Kingetsu
Fujitsu Research
Sarath Chandar
Sarath Chandar
Associate Professor @ Polytechnique Montreal. Mila. Canada CIFAR AI Chair. Canada Research Chair.
Artificial IntelligenceMachine LearningDeep LearningReinforcement LearningNLP
Irina Rish
Irina Rish
University of Montreal / Mila -Quebec AI Institute
Artificial IntelligenceMachine LearningNeuroscience