Bayes optimal learning of attention-indexed models

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Transformer attention lacks a tractable learning-theoretic foundation. Method: We propose the Attention Index Model (AIM), a theoretically rigorous framework characterizing token-level outputs as hierarchical bilinear interactions within high-dimensional embeddings; it is the first analytically solvable model supporting full-rank key/query matrices, aligning closely with practical architectures. Our analysis integrates statistical mechanics, random matrix theory, and approximate message passing to perform the first complete learning-theoretic characterization of full-rank attention—deriving a closed-form expression for the Bayes-optimal generalization error and revealing a sharp phase transition governed jointly by sample size, model width, and sequence length. Contribution/Results: Theory–experiment agreement is strong, confirming that gradient descent achieves Bayes-optimal performance. AIM establishes the first solvable, analytically tractable, and architecturally realistic learning theory for attention mechanisms.

Technology Category

Application Category

📝 Abstract

We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in modern attention architectures.

Problem

Research questions and friction points this paper is trying to address.

Analyzing learning in deep attention layers theoretically

Deriving Bayes-optimal generalization error predictions

Understanding learning in modern attention architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces attention-indexed model (AIM) framework

Derives Bayes-optimal generalization error predictions

Proposes matching approximate message passing algorithm

🔎 Similar Papers

No similar papers found.

Amazon

Arlington, VA, USA / Bellevue, WA, USA / Boston, MA, USA

Machine Learning Engineer, PhD Intern

Instacart

CA, NY, CT, NJ$50—$50 USDWA$47.50—$47.50 USDOR, DE, ME, MA, MD, NH, RI, VT, DC, PA, VA, CO, TX, IL, HI$44—$44 USDAll other states$42—$42 USD

remote

Authors to Follow