🤖 AI Summary
This work establishes a theoretical framework for transfer learning with linear models in the large-model asymptotic regime, focusing on the conditions under which fine-tuning a pretrained model on a small number of target-domain samples improves generalization. Using high-dimensional asymptotic analysis and random matrix theory, we derive exact characterizations of the fine-tuning error for both linear regression and binary classification, and rigorously identify necessary and sufficient conditions for fine-tuned performance to strictly surpass that of the pretrained model. A key contribution is that our theoretical results depend only on the first- and second-order statistics of the target data distribution—eliminating any reliance on Gaussianity assumptions and thereby achieving broad distributional robustness. Furthermore, we prove that stochastic gradient descent (SGD) fine-tuning yields substantial generalization gains when the source and target distributions are sufficiently aligned.
📝 Abstract
Transfer learning is an attractive framework for problems where there is a paucity of data, or where data collection is costly. One common approach to transfer learning is referred to as"model-based", and involves using a model that is pretrained on samples from a source distribution, which is easier to acquire, and then fine-tuning the model on a few samples from the target distribution. The hope is that, if the source and target distributions are ``close", then the fine-tuned model will perform well on the target distribution even though it has seen only a few samples from it. In this work, we study the problem of transfer learning in linear models for both regression and binary classification. In particular, we consider the use of stochastic gradient descent (SGD) on a linear model initialized with pretrained weights and using a small training data set from the target distribution. In the asymptotic regime of large models, we provide an exact and rigorous analysis and relate the generalization errors (in regression) and classification errors (in binary classification) for the pretrained and fine-tuned models. In particular, we give conditions under which the fine-tuned model outperforms the pretrained one. An important aspect of our work is that all the results are"universal", in the sense that they depend only on the first and second order statistics of the target distribution. They thus extend well beyond the standard Gaussian assumptions commonly made in the literature.