Universal Approximation with Softmax Attention

📅 2025-04-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates whether the self-attention mechanism—relying solely on linear transformations and the Softmax nonlinearity—can serve as a universal approximator for continuous sequence-to-sequence functions over compact domains. Method: Through interpolation analysis and explicit construction of attention weights, the authors establish approximation guarantees by proving that self-attention can uniformly approximate generalized ReLU functions—a key technical step enabling universal approximation without auxiliary components. Contribution/Results: The work proves that either (i) two-layer multi-head self-attention or (ii) single-layer self-attention followed by Softmax suffices to approximate any continuous sequence mapping on a compact domain to arbitrary precision—without requiring conventional feed-forward networks. This constitutes the first rigorous demonstration of universal approximation capability in pure attention architectures. The result is further extended to contextual statistical modeling. By decoupling self-attention from mandatory feed-forward layers, the study challenges the standard Transformer paradigm and provides a theoretical foundation for lightweight, interpretable, feed-forward-free attention models.

Technology Category

Application Category

📝 Abstract
We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our main technique is a new interpolation-based method for analyzing attention's internal mechanism. This leads to our key insight: self-attention is able to approximate a generalized version of ReLU to arbitrary precision, and hence subsumes many known universal approximators. Building on these, we show that two-layer multi-head attention alone suffices as a sequence-to-sequence universal approximator. In contrast, prior works rely on feed-forward networks to establish universal approximation in Transformers. Furthermore, we extend our techniques to show that, (softmax-)attention-only layers are capable of approximating various statistical models in-context. We believe these techniques hold independent interest.
Problem

Research questions and friction points this paper is trying to address.

Proving self-attention's universal approximation capability for sequence functions
Analyzing attention's mechanism via interpolation to approximate ReLU
Demonstrating multi-head attention suffices as universal approximator without feed-forward networks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear transformations enable universal approximation
Interpolation-based method analyzes attention mechanism
Self-attention approximates generalized ReLU precisely
🔎 Similar Papers
No similar papers found.
J
J. Hu
Center for Foundation Models and Generative AI, Northwestern University, Evanston, IL 60208, USA
H
Hude Liu
School of Mathematical Sciences, Fudan University, Shanghai 200433, China
Hong-Yu Chen
Hong-Yu Chen
PhD, Department of Computer Science, Northwestern University
Machine LearningFoundation ModelsAI4Science
Weimin Wu
Weimin Wu
Ph.D. Candidate in Computer Science, Northwestern University
AI for BiologyML Theory
H
Han Liu
Center for Foundation Models and Generative AI, Northwestern University, Evanston, IL 60208, USA; Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208, USA