Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

๐Ÿ“… 2025-10-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the in-context learning (ICL) mechanism of Transformers for modeling dynamics-driven Markov functions. Method: For a single-layer linear self-attention (LSA) model, we derive the first closed-form expression for its global optimum and prove that parameter recovery is NP-hardโ€”revealing a fundamental limitation in representing structured dynamical functions. We further show that multi-layer architectures are equivalent to preconditioned gradient descent over multiple objectives, thereby overcoming the expressivity bottleneck of single-layer models. Contribution/Results: Through loss landscape analysis, complexity-theoretic proofs, and numerical experiments on simplified Transformers, we systematically characterize the theoretical limits of model expressivity under ICL. Our results rigorously establish that multi-layer architectures enable effective multi-objective optimization, significantly enhancing representational capacity beyond single-layer LSA.

Technology Category

Application Category

๐Ÿ“ Abstract
Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.
Problem

Research questions and friction points this paper is trying to address.

Investigating transformers' in-context learning for Markovian dynamical functions
Analyzing loss landscape and optimization behaviors in structured learning
Proving NP-hardness of recovering optimal transformer parameters
Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form global minimizer for linear self-attention model
NP-hardness proof for transformer parameter recovery
Multilayer self-attention performs preconditioned gradient descent