Learning Linear Attention in Polynomial Time

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 1

career value

202K/year

🤖 AI Summary

This work addresses the open problem of learnability for single-layer linear-attention Transformers, establishing—for the first time—their strong PAC-learnability: provably efficient polynomial-time learning without distributional assumptions. Methodologically, it rigorously reduces the learning problem to linear prediction in a Reproducing Kernel Hilbert Space (RKHS), leveraging RKHS feature mappings, multi-head attention reconstruction, and symmetry analysis of empirical risk minimization. Theoretically, it bridges computational expressivity and statistical learnability for the first time, demonstrating that such models not only simulate associative memory and finite automata but also efficiently learn a class of universal Turing machines (UTMs). Empirical validation confirms accurate convergence of random linear-attention networks on key-value mapping and automaton recognition tasks.

Technology Category

Application Category

📝 Abstract

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key--value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.

Problem

Research questions and friction points this paper is trying to address.

Studying learnability of linear attention Transformers from observational data

Providing polynomial-time learnability results for single-layer Transformers

Identifying computations expressible via linear attention for efficient learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention enables polynomial-time learnability

Learning linear transformers via linear predictors in RKHS

Efficiently learnable computations include associative memories and automata

🔎 Similar Papers

No similar papers found.