🤖 AI Summary
This work investigates the fundamental learnability limits of deep multi-layer self-attention networks on high-dimensional sequential data. We establish, for the first time, a rigorous equivalence between weighted-binding and low-rank-constrained attention architectures and sequence-valued multilinear models. Under the high-dimensional large-sample asymptotic regime, we characterize both the Bayes-optimal generalization error and the fundamental limits of polynomial-time achievable performance. Leveraging high-dimensional asymptotic analysis, Bayesian optimal inference, and approximate message passing (AMP), we derive the intrinsic layerwise learning dynamics and precisely identify the sample-complexity phase transition threshold required to surpass random guessing. Our theoretical predictions—particularly the hierarchical learning phenomenon—are consistently validated in empirical training. This yields the first sharp, testable theoretical framework for understanding the representational capacity and data efficiency of attention-based models.
📝 Abstract
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayesian-optimal learning, in the limit of large dimension $D$ and commensurably large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.