๐ค AI Summary
This work investigates the in-context learning (ICL) mechanism of Transformers for modeling dynamics-driven Markov functions. Method: For a single-layer linear self-attention (LSA) model, we derive the first closed-form expression for its global optimum and prove that parameter recovery is NP-hardโrevealing a fundamental limitation in representing structured dynamical functions. We further show that multi-layer architectures are equivalent to preconditioned gradient descent over multiple objectives, thereby overcoming the expressivity bottleneck of single-layer models. Contribution/Results: Through loss landscape analysis, complexity-theoretic proofs, and numerical experiments on simplified Transformers, we systematically characterize the theoretical limits of model expressivity under ICL. Our results rigorously establish that multi-layer architectures enable effective multi-objective optimization, significantly enhancing representational capacity beyond single-layer LSA.
๐ Abstract
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.