🤖 AI Summary
This study addresses the challenge of continual pretraining of large language models (LLMs) in dynamic knowledge environments, aiming to balance assimilation of new knowledge with retention of prior knowledge. To this end, we introduce the first benchmark specifically designed for evaluating continual pretraining under evolving data distributions, enabling systematic analysis of the interplay among model scale, semantic structure of domain sequences, and knowledge transfer/forgetting. We propose a novel cross-domain adaptive evaluation paradigm and uncover three key findings: (i) smaller models (<1.5B parameters) exhibit high sensitivity to both learning and forgetting; (ii) semantically ordered domain sequences foster specialization, whereas random sequences enhance generalization and cross-domain transfer; and (iii) larger models consistently achieve lower perplexity. Empirical results demonstrate that our continual pretraining paradigm significantly improves downstream task performance across the GPT-2 family, with particularly pronounced gains for smaller models.
📝 Abstract
Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) continual pretraining consistently improves<1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.