🤖 AI Summary
This work addresses the critical challenge in continual learning of mitigating catastrophic forgetting during downstream fine-tuning while preserving capabilities acquired during upstream training. The authors propose treating “robustness to subsequent fine-tuning” as a first-class objective in upstream training and systematically investigate data scheduling strategies across a three-stage pipeline—pretraining, post-training, and downstream fine-tuning. Their key finding is that early exposure to post-training data during pretraining—termed “early data exposure”—consistently outperforms pure post-training or conventional mixing strategies, yielding superior trade-offs between upstream knowledge retention and downstream task performance across model scales from 135M to 1B parameters. This approach complements regularization techniques such as replay and Dropout and, under fixed compute budgets, reveals an optimal data allocation scheme.
📝 Abstract
How can we train models whose post-trained capabilities survive subsequent fine-tuning? Rather than focusing on downstream interventions to mitigate forgetting of upstream capabilities, we study how upstream training choices - that is, the manner in which a capability is acquired - shape how robustly that capability is retained. We investigate this question in a controlled three-stage language-model pipeline: pretraining, post-training to acquire a target capability, and downstream fine-tuning on a new objective. Across 135M and 1B models, two post-training domains, and two downstream fine-tuning tasks, we find that immediate post-training performance does not reliably predict retention after subsequent fine-tuning: training recipes that look equivalent immediately after post-training can retain the target capability very differently after subsequent fine-tuning. In particular, early exposure - mixing post-training data into pretraining - consistently improves the frontier between retained upstream performance and downstream performance. In compute-matched experiments, where the target data must be allocated between pretraining and post-training, we find that the optimum lies at neither extreme. Together with our other empirical and theoretical findings, this supports the view that post-training drives immediate specialization while early exposure improves robustness to later forgetting. Replay and dropout, typically used to mitigate forgetting as it occurs during fine-tuning, provide complementary gains to early exposure when applied during post-training. Our findings suggest that robustness to subsequent fine-tuning should be treated as a first-class objective of upstream training, addressed preventatively through choices like early exposure rather than reactively during fine-tuning itself.