🤖 AI Summary
This paper addresses two fundamental challenges in foundation model research: the opaque nature of representation mechanisms and diminishing returns from scaling. To resolve these, we propose the “contexture” theory—a unified characterization of representation learning wherein optimal representations maximize mutual information between inputs and contextual variables, with peak generalization achieved at moderate contextual strength. We establish the first unified mathematical framework proving that scaling bottlenecks stem primarily from contextual *quality*, not scale. We introduce two general-purpose context-aware learning objectives—SVME and KISE—and a multi-context fusion strategy. Leveraging information theory and statistical learning theory, we derive a generalization bound for representation learning, unifying theoretical explanations across supervised, self-supervised, and generative pretraining paradigms. Empirical validation confirms that mainstream pretraining objectives implicitly optimize contexture. Our work provides both theoretical foundations and practical guidelines for designing efficient, context-driven pretraining frameworks.
📝 Abstract
This dissertation establishes the contexture theory to mathematically characterize the mechanism of representation learning, or pretraining. Despite the remarkable empirical success of foundation models, it is not very clear what representations they learn, and why these representations are useful for various downstream tasks. A scientific understanding of representation learning is critical, especially at this point when scaling up the model size is producing diminishing returns, and designing new pretraining methods is imperative for further progress. Prior work treated different representation learning methods quite differently, whereas the contexture theory provides a unified framework for analyzing these methods. The central argument is that a representation is learned from the association between the input X and a context variable A. We prove that if an encoder captures the maximum information of this association, in which case we say that the encoder learns the contexture, then it will be optimal on the class of tasks that are compatible with the context. We also show that a context is the most useful when the association between X and A is neither too strong nor too weak. The important implication of the contexture theory is that increasing the model size alone will achieve diminishing returns, and further advancements require better contexts. We demonstrate that many pretraining objectives can learn the contexture, including supervised learning, self-supervised learning, generative models, etc. Then, we introduce two general objectives -- SVME and KISE, for learning the contexture. We also show how to mix multiple contexts together, an effortless way to create better contexts from existing ones. Then, we prove statistical learning bounds for representation learning. Finally, we discuss the effect of the data distribution shift from pretraining to the downstream task.