Optimality and NP-Hardness of Transformers in Learning Markovian Dynamical Functions

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work investigates the in-context learning (ICL) mechanism of Transformers for modeling dynamics-driven Markov functions. Method: For a single-layer linear self-attention (LSA) model, we derive the first closed-form expression for its global optimum and prove that parameter recovery is NP-hard—revealing a fundamental limitation in representing structured dynamical functions. We further show that multi-layer architectures are equivalent to preconditioned gradient descent over multiple objectives, thereby overcoming the expressivity bottleneck of single-layer models. Contribution/Results: Through loss landscape analysis, complexity-theoretic proofs, and numerical experiments on simplified Transformers, we systematically characterize the theoretical limits of model expressivity under ICL. Our results rigorously establish that multi-layer architectures enable effective multi-objective optimization, significantly enhancing representational capacity beyond single-layer LSA.

Technology Category

Application Category

📝 Abstract

Transformer architectures can solve unseen tasks based on input-output pairs in a given prompt due to in-context learning (ICL). Existing theoretical studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs. To understand how transformers express ICL when modeling dynamics-driven functions, we investigate Markovian function learning through a structured ICL setup, where we characterize the loss landscape to reveal underlying optimization behaviors. Specifically, we (1) provide the closed-form expression of the global minimizer (in an enlarged parameter space) for a single-layer linear self-attention (LSA) model; (2) prove that recovering transformer parameters that realize the optimal solution is NP-hard in general, revealing a fundamental limitation of one-layer LSA in representing structured dynamical functions; and (3) supply a novel interpretation of a multilayer LSA as performing preconditioned gradient descent to optimize multiple objectives beyond the square loss. These theoretical results are numerically validated using simplified transformers.

Problem

Research questions and friction points this paper is trying to address.

Investigating transformers' in-context learning for Markovian dynamical functions

Analyzing loss landscape and optimization behaviors in structured learning

Proving NP-hardness of recovering optimal transformer parameters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Closed-form global minimizer for linear self-attention model

NP-hardness proof for transformer parameter recovery

Multilayer self-attention performs preconditioned gradient descent

🔎 Similar Papers

Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains