Theory of Scaling Laws for In-Context Regression: Depth, Width, Context and Time

📅 2025-10-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the mechanism by which in-context learning (ICL) in deep linear self-attention models performs linear regression, systematically characterizing its performance dependence on computational and statistical resources—including model width, depth, training steps, batch size, and context length. Method: We introduce an analytically tractable neural scaling law toy model and derive, in the high-dimensional asymptotic limit, exact expressions and power-law scaling laws for the ICL risk. Contribution/Results: Our analysis is the first to show that depth improves ICL performance significantly only when the input covariance varies across contexts; based on this, we predict the computationally optimal Transformer architecture. The theory unifies asymptotic risk analysis across three canonical ICL settings—fixed, isotropic, and context-dependent covariance—and provides a rigorous framework for understanding resource efficiency and architectural design of large models in ICL.

Technology Category

Application Category

📝 Abstract
We study in-context learning (ICL) of linear regression in a deep linear self-attention model, characterizing how performance depends on various computational and statistical resources (width, depth, number of training steps, batch size and data per context). In a joint limit where data dimension, context length, and residual stream width scale proportionally, we analyze the limiting asymptotics for three ICL settings: (1) isotropic covariates and tasks (ISO), (2) fixed and structured covariance (FS), and (3) where covariances are randomly rotated and structured (RRS). For ISO and FS settings, we find that depth only aids ICL performance if context length is limited. Alternatively, in the RRS setting where covariances change across contexts, increasing the depth leads to significant improvements in ICL, even at infinite context length. This provides a new solvable toy model of neural scaling laws which depends on both width and depth of a transformer and predicts an optimal transformer shape as a function of compute. This toy model enables computation of exact asymptotics for the risk as well as derivation of powerlaws under source/capacity conditions for the ICL tasks.
Problem

Research questions and friction points this paper is trying to address.

Analyzes scaling laws for in-context linear regression learning
Characterizes performance dependencies on transformer depth and width
Establishes optimal transformer shape based on computational resources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep linear self-attention model analyzes scaling laws
Depth improves performance with changing covariances across contexts
Toy model computes exact asymptotics and predicts optimal transformer shape
🔎 Similar Papers
No similar papers found.
Blake Bordelon
Blake Bordelon
Postdoctoral Fellow at Harvard CMSA
Machine LearningTheoretical Neuroscience
M
Mary I. Letey
John A. Paulson School of Engineering and Applied Sciences, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University
Cengiz Pehlevan
Cengiz Pehlevan
Harvard University
Neural NetworksTheoretical NeuroscienceMachine LearningPhysics of Learning