In-context learning and Occam's razor

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the intrinsic connection between in-context learning (ICL) in large language models and Occam’s Razor, focusing on improving generalization through implicit model simplicity control. Method: We theoretically establish that next-token prediction loss is equivalent to pre-order coding—a joint information compression mechanism minimizing both training error and implicit complexity. Leveraging Transformer-based sequence modeling, information-theoretic analysis, and empirical validation, we demonstrate that this loss inherently balances fidelity and simplicity. Contribution/Results: We present the first formal equivalence between ICL and the principle of model simplicity, providing a normative theoretical foundation for ICL and exposing fundamental limitations in current methods’ implicit complexity regulation. Our analysis shows that the standard autoregressive loss naturally enforces parsimony, leading to significantly improved generalization. Code is publicly released to ensure reproducibility.

Technology Category

Application Category

📝 Abstract
A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.
Problem

Research questions and friction points this paper is trying to address.

Machine Learning
Generalization
Occam's Razor
Innovation

Methods, ideas, or system contributions that make the work stand out.

Occam's Razor
Contextual Learning
Generalization Improvement
🔎 Similar Papers
No similar papers found.