🤖 AI Summary
This work investigates whether in-context learning (ICL) is theoretically equivalent to gradient descent (GD) under more realistic settings—specifically, with nonzero-mean Gaussian priors and multi-head linear self-attention (LSA). To bridge the gap between idealized assumptions and practice, we propose yq-LSA, a novel ICL architecture featuring trainable initial query embeddings, thereby relaxing the conventional zero-initialization constraint. We establish the first general theoretical equivalence between ICL and one-step GD, proving both the minimal number of attention heads required for exact ICL emulation and an upper bound on the approximation error. Empirically, yq-LSA significantly narrows the performance gap with one-step GD and consistently improves semantic similarity accuracy on real large language models. Our core contributions are threefold: (i) incorporation of nonzero-mean priors for realistic task distribution modeling; (ii) introduction of learnable initialization for enhanced expressivity; and (iii) rigorous theoretical characterization that unifies mechanistic understanding of ICL with its empirical efficacy.
📝 Abstract
In-context learning (ICL) in large language models (LLMs) is a striking phenomenon, yet its underlying mechanisms remain only partially understood. Previous work connects linear self-attention (LSA) to gradient descent (GD), this connection has primarily been established under simplified conditions with zero-mean Gaussian priors and zero initialization for GD. However, subsequent studies have challenged this simplified view by highlighting its overly restrictive assumptions, demonstrating instead that under conditions such as multi-layer or nonlinear attention, self-attention performs optimization-like inference, akin to but distinct from GD. We investigate how multi-head LSA approximates GD under more realistic conditions specifically when incorporating non-zero Gaussian prior means in linear regression formulations of ICL. We first extend multi-head LSA embedding matrix by introducing an initial estimation of the query, referred to as the initial guess. We prove an upper bound on the number of heads needed for ICL linear regression setup. Our experiments confirm this result and further observe that a performance gap between one-step GD and multi-head LSA persists. To address this gap, we introduce yq-LSA, a simple generalization of single-head LSA with a trainable initial guess yq. We theoretically establish the capabilities of yq-LSA and provide experimental validation on linear regression tasks, thereby extending the theory that bridges ICL and GD. Finally, inspired by our findings in the case of linear regression, we consider widespread LLMs augmented with initial guess capabilities, and show that their performance is improved on a semantic similarity task.