๐ค AI Summary
This work establishes the theoretical foundations of in-context learning (ICL) in linear-attention Transformers, addressing how pretraining sample size, task diversity, and context length jointly govern ICL performance.
Method: Leveraging high-dimensional asymptotic analysis and rigorous scaling-limit derivations, we develop the first asymptotic theory for ICL, characterizing its double-descent learning curve and a phase transition driven by task diversityโwhere low diversity induces memorization and high diversity enables generalization.
Contribution/Results: We derive tight necessary conditions for successful ICL in linear regression tasks and validate them empirically across linear-attention models and fully nonlinear Transformers. Our framework provides the first analytically tractable and empirically verifiable theory of ICL, unifying scaling laws across key architectural and statistical dimensions. The results rigorously connect pretraining data scale, contextual capacity, and task distribution geometry to emergent in-context inference behavior.
๐ Abstract
Transformers have a remarkable ability to learn and execute tasks based on examples provided within the input itself, without explicit prior training. It has been argued that this capability, known as in-context learning (ICL), is a cornerstone of Transformers' success, yet questions about the necessary sample complexity, pretraining task diversity, and context length for successful ICL remain unresolved. Here, we provide a precise answer to these questions in an exactly solvable model of ICL of a linear regression task by linear attention. We derive sharp asymptotics for the learning curve in a phenomenologically-rich scaling regime where the token dimension is taken to infinity; the context length and pretraining task diversity scale proportionally with the token dimension; and the number of pretraining examples scales quadratically. We demonstrate a double-descent learning curve with increasing pretraining examples, and uncover a phase transition in the model's behavior between low and high task diversity regimes: In the low diversity regime, the model tends toward memorization of training tasks, whereas in the high diversity regime, it achieves genuine in-context learning and generalization beyond the scope of pretrained tasks. These theoretical insights are empirically validated through experiments with both linear attention and full nonlinear Transformer architectures.