Asymptotic Study of In-context Learning with Random Transformers through Equivalent Models

📅 2025-09-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This work investigates in-context learning (ICL) in two-layer Transformers with random initialization and a fixed first layer, applied to nonlinear regression. Under an asymptotic regime where context length, input dimension, hidden dimension, number of training tasks, and per-task samples jointly tend to infinity, we rigorously prove that the ICL behavior is exactly equivalent to a finite-order Hermite polynomial model. This equivalence reveals how the MLP’s second layer—through the interplay of nonlinear activation and overparameterization—enhances ICL performance and provides a unified explanation for the double-descent phenomenon. Extensive experiments validate the predictive accuracy and generalizability of this equivalent model across diverse activation functions, regularization schemes, and architectural scales. Our analysis establishes the first analytically tractable theoretical framework for nonlinear ICL in Transformers, offering novel insights into their implicit inductive biases.

Technology Category

Application Category

📝 Abstract

We study the in-context learning (ICL) capabilities of pretrained Transformers in the setting of nonlinear regression. Specifically, we focus on a random Transformer with a nonlinear MLP head where the first layer is randomly initialized and fixed while the second layer is trained. Furthermore, we consider an asymptotic regime where the context length, input dimension, hidden dimension, number of training tasks, and number of training samples jointly grow. In this setting, we show that the random Transformer behaves equivalent to a finite-degree Hermite polynomial model in terms of ICL error. This equivalence is validated through simulations across varying activation functions, context lengths, hidden layer widths (revealing a double-descent phenomenon), and regularization settings. Our results offer theoretical and empirical insights into when and how MLP layers enhance ICL, and how nonlinearity and over-parameterization influence model performance.

Problem

Research questions and friction points this paper is trying to address.

Studying in-context learning capabilities of random Transformers

Analyzing nonlinear regression with asymptotic growth regime

Investigating MLP layers' enhancement of ICL performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Random Transformer with nonlinear MLP head

Equivalent finite-degree Hermite polynomial model

Asymptotic regime with joint dimensional growth

🔎 Similar Papers

Asymptotic theory of in-context learning by linear attention