Investigating Continual Pretraining in Large Language Models: Insights and Implications

📅 2024-02-27

🏛️ arXiv.org

📈 Citations: 12

✨ Influential: 1

career value

211K/year

🤖 AI Summary

This study addresses the challenge of continual pretraining of large language models (LLMs) in dynamic knowledge environments, aiming to balance assimilation of new knowledge with retention of prior knowledge. To this end, we introduce the first benchmark specifically designed for evaluating continual pretraining under evolving data distributions, enabling systematic analysis of the interplay among model scale, semantic structure of domain sequences, and knowledge transfer/forgetting. We propose a novel cross-domain adaptive evaluation paradigm and uncover three key findings: (i) smaller models (<1.5B parameters) exhibit high sensitivity to both learning and forgetting; (ii) semantically ordered domain sequences foster specialization, whereas random sequences enhance generalization and cross-domain transfer; and (iii) larger models consistently achieve lower perplexity. Empirical results demonstrate that our continual pretraining paradigm significantly improves downstream task performance across the GPT-2 family, with particularly pronounced gains for smaller models.

Technology Category

Application Category

📝 Abstract

Continual learning (CL) in large language models (LLMs) is an evolving domain that focuses on developing efficient and sustainable training strategies to adapt models to emerging knowledge and achieve robustness in dynamic environments. Our primary emphasis is on continual domain-adaptive pretraining, a process designed to equip LLMs with the ability to integrate new information from various domains while retaining previously learned knowledge. Since existing works concentrate mostly on continual fine-tuning for a limited selection of downstream tasks or training domains, we introduce a new benchmark designed to measure the adaptability of LLMs to changing pretraining data landscapes. We further examine the impact of model size on learning efficacy and forgetting, as well as how the progression and similarity of emerging domains affect the knowledge transfer within these models. Our findings uncover several key insights: (i) continual pretraining consistently improves<1.5B models studied in this work and is also superior to domain adaptation, (ii) larger models always achieve better perplexity than smaller ones when continually pretrained on the same corpus, (iii) smaller models are particularly sensitive to continual pretraining, showing the most significant rates of both learning and forgetting, (iv) continual pretraining boosts downstream task performance of GPT-2 family, (v) continual pretraining enables LLMs to specialize better when the sequence of domains shows semantic similarity while randomizing training domains leads to better transfer and final performance otherwise. We posit that our research establishes a new benchmark for CL in LLMs, providing a more realistic evaluation of knowledge retention and transfer across diverse domains.

Problem

Research questions and friction points this paper is trying to address.

Explores continual pretraining in large language models.

Measures adaptability to changing pretraining data landscapes.

Examines model size impact on learning and forgetting.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Continual domain-adaptive pretraining in LLMs

New benchmark for LLM adaptability

Impact of model size on learning

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models