Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work investigates the inductive bias and spectral properties of single-head attention layers in high-dimensional sequence tasks under empirical risk minimization. Leveraging tools from random matrix theory, approximate message passing, and spin-glass theory, we rigorously characterize the implicit regularization induced by weight decay—revealing its mechanism for biasing models toward low-rank query and key matrices. We provide the first theoretical explanation for the empirically observed spectral distribution of attention weights in large-scale Transformers. By introducing a factorized parameterization that explicitly encodes this inductive bias, we derive modified learning dynamics and precisely predict the asymptotic behavior of training and test errors, as well as interpolation and recovery phase transitions. The theoretically derived spectral distributions align closely with empirical measurements from large models, establishing the first verifiable asymptotic analytical framework for understanding Transformer generalization.

Technology Category

Application Category

📝 Abstract

We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks, given by the recently introduced attention-indexed model. Using tools from random matrix theory, spin-glass physics, and approximate message passing, we derive sharp asymptotics for training and test errors, locate interpolation and recovery thresholds, and characterize the limiting spectral distribution of the learned weights. Weight decay induces an implicit nuclear-norm regularization, favoring low-rank query and key matrices. Leveraging this, we compare the standard factorized training of query and key matrices with a direct parameterization in which their product is trained element-wise, revealing the inductive bias introduced by the factorized form. Remarkably, the predicted spectral distribution echoes empirical trends reported in large-scale transformers, offering a theoretical perspective consistent with these phenomena.

Problem

Research questions and friction points this paper is trying to address.

Analyzes high-dimensional single-head attention training dynamics and spectral properties

Characterizes implicit low-rank bias induced by weight decay regularization

Compares factorized versus direct parameterization of attention weight matrices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weight decay induces implicit nuclear-norm regularization

Compares factorized training with direct parameterization approach

Uses random matrix theory to analyze spectral distribution

🔎 Similar Papers

No similar papers found.