🤖 AI Summary
This work addresses the lack of an end-to-end theoretical understanding of how pretraining initialization influences feature reuse and learning during fine-tuning. By constructing an analytical framework for pretraining–fine-tuning dynamics in diagonal linear networks, the authors derive exact expressions for generalization error as a function of initialization scale and task statistics, thereby revealing— for the first time—how initialization shapes the inductive bias of fine-tuning. The analysis identifies four distinct fine-tuning mechanisms governed primarily by the scale of initialization and demonstrates that shallow, small-scale initializations confer an advantage in subset-feature tasks. These theoretical predictions are validated through experiments on CIFAR-100 with nonlinear networks, confirming that the distribution of initialization scales significantly modulates fine-tuning generalization performance.
📝 Abstract
Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.