🤖 AI Summary
This work investigates the optimal choice of representation dimensionality to achieve maximal generalization performance in settings where pretraining and downstream task data exhibit significant imbalance. By modeling pretraining as principal component analysis on unlabeled data and downstream learning as linear regression on labeled data, the authors employ high-dimensional statistical analysis to derive, for the first time, exact expressions for training and generalization errors in the high-dimensional asymptotic limit. Their theoretical results establish a quantitative relationship among representation dimensionality, data scale, and task alignment, yielding an analytical condition for the optimal representation size: low-dimensional compressed representations are preferable when labeled downstream data are scarce but pretraining data are abundant, whereas high-dimensional representations are advantageous in the opposite regime. The study further quantifies the extent to which unlabeled data can substitute for labeled samples and validates the proposed mechanism in both autoencoders and large language models.
📝 Abstract
Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.