A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning

📅 2026-02-23

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the lack of an end-to-end theoretical understanding of how pretraining initialization influences feature reuse and learning during fine-tuning. By constructing an analytical framework for pretraining–fine-tuning dynamics in diagonal linear networks, the authors derive exact expressions for generalization error as a function of initialization scale and task statistics, thereby revealing— for the first time—how initialization shapes the inductive bias of fine-tuning. The analysis identifies four distinct fine-tuning mechanisms governed primarily by the scale of initialization and demonstrates that shallow, small-scale initializations confer an advantage in subset-feature tasks. These theoretical predictions are validated through experiments on CIFAR-100 with nonlinear networks, confirming that the distribution of initialization scales significantly modulates fine-tuning generalization performance.

Technology Category

Application Category

📝 Abstract

Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.

Problem

Research questions and friction points this paper is trying to address.

pretraining

fine-tuning

inductive bias

initialization

feature learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

pretraining-fine-tuning

inductive bias

feature reuse