Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

📅 2025-02-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Pretrained language models face dual challenges of overfitting and catastrophic forgetting during fine-tuning on target domains. This work systematically quantifies the effects of data volume, model scale, and target-domain characteristics on both phenomena, establishing—for the first time—a unified scaling law that governs both forgetting and overfitting; both exhibit power-law decay. We validate this law across multiple domains and model sizes through language modeling fine-tuning experiments, cross-domain data mixing, and generalization error analysis. Key contributions include: (i) demonstrating that injecting merely 1% of the original pretraining data into the fine-tuning stage substantially mitigates forgetting, thereby enhancing fine-tuning stability and out-of-distribution generalization; and (ii) providing a predictive theoretical framework and practical intervention strategy for balancing domain-specific adaptation against retained general-purpose capabilities. The derived scaling law offers principled guidance for optimizing the trade-off between specialization and robustness in downstream adaptation.

Technology Category

Application Category

📝 Abstract

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. We aim to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as 1% of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.

Problem

Research questions and friction points this paper is trying to address.

Quantify forgetting during finetuning

Prevent overfitting with limited data

Inject pretraining data to retain knowledge

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inject pretraining data

Prevent model forgetting

Mitigate overfitting issues

🔎 Similar Papers

An Efficient Rehearsal Scheme for Catastrophic Forgetting Mitigation during Multi-stage Fine-tuning