🤖 AI Summary
This work investigates the in-context learning (ICL) mechanism of Transformers in underdetermined inverse linear regression (ILR), aiming to elucidate how they implicitly infer high-dimensional unknown weight vectors from limited contextual examples. We introduce a linear Transformer model and conduct a rigorous theoretical analysis grounded in implicit regularization theory, complemented by comprehensive numerical experiments. Our key contributions are: (i) the first demonstration that Transformers can adaptively learn task-specific prior distributions across tasks and perform implicit regularization—departing fundamentally from explicit regularization paradigms such as ridge regression; (ii) a precise characterization of the necessary condition for successful learning: the task dimension must be less than the context length; and (iii) a tight error bound showing that estimation error scales linearly with noise level, the dimension-to-context ratio, and the condition number of the input matrix—consistently outperforming classical regularized estimators. This work establishes the first rigorous analytical framework for understanding ICL in inverse problems.
📝 Abstract
Transformers have shown a remarkable ability for in-context learning (ICL), making predictions based on contextual examples. However, while theoretical analyses have explored this prediction capability, the nature of the inferred context and its utility for downstream predictions remain open questions. This paper aims to address these questions by examining ICL for inverse linear regression (ILR), where context inference can be characterized by unsupervised learning of underlying weight vectors. Focusing on the challenging scenario of rank-deficient inverse problems, where context length is smaller than the number of unknowns in the weight vectors and regularization is necessary, we introduce a linear transformer to learn the inverse mapping from contextual examples to the underlying weight vector. Our findings reveal that the transformer implicitly learns both a prior distribution and an effective regularization strategy, outperforming traditional ridge regression and regularization methods. A key insight is the necessity of low task dimensionality relative to the context length for successful learning. Furthermore, we numerically verify that the error of the transformer estimator scales linearly with the noise level, the ratio of task dimension to context length, and the condition number of the input data. These results not only demonstrate the potential of transformers for solving ill-posed inverse problems, but also provide a new perspective towards understanding the knowledge extraction mechanism within transformers.