The Initialization Determines Whether In-Context Learning Is Gradient Descent

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether in-context learning (ICL) is theoretically equivalent to gradient descent (GD) under more realistic settings—specifically, with nonzero-mean Gaussian priors and multi-head linear self-attention (LSA). To bridge the gap between idealized assumptions and practice, we propose yq-LSA, a novel ICL architecture featuring trainable initial query embeddings, thereby relaxing the conventional zero-initialization constraint. We establish the first general theoretical equivalence between ICL and one-step GD, proving both the minimal number of attention heads required for exact ICL emulation and an upper bound on the approximation error. Empirically, yq-LSA significantly narrows the performance gap with one-step GD and consistently improves semantic similarity accuracy on real large language models. Our core contributions are threefold: (i) incorporation of nonzero-mean priors for realistic task distribution modeling; (ii) introduction of learnable initialization for enhanced expressivity; and (iii) rigorous theoretical characterization that unifies mechanistic understanding of ICL with its empirical efficacy.

Technology Category

Application Category

📝 Abstract
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.
Problem

Research questions and friction points this paper is trying to address.

Investigates how multi-head linear self-attention approximates gradient descent in realistic conditions.
Addresses the performance gap between one-step gradient descent and multi-head linear self-attention.
Proposes and validates a generalized self-attention model with trainable initial guess to bridge ICL and GD.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introducing trainable initial guess in self-attention
Extending multi-head linear self-attention with non-zero priors
Bridging in-context learning and gradient descent theoretically
🔎 Similar Papers
No similar papers found.