🤖 AI Summary
Existing analyses of implicit bias in self-attention mechanisms under gradient descent focus on local, asymptotic convergence to the maximum-margin solution $W_{ ext{mm}}$ for key-query matrices $W_t$, limited to binary classification with linear decoders—lacking global convergence guarantees, finite-time bounds, and adaptive step-size analysis.
Method: We propose normalized gradient descent combined with Polyak’s adaptive step-size rule, operating within a non-convex optimization framework.
Contribution/Results: We establish the first global convergence theory guaranteeing directional convergence to $W_{ ext{mm}}$ in finite iterations. Moreover, we theoretically characterize the sparsification rate of attention maps, systematically linking the implicit bias of Transformers to classical statistical learning principles—thereby enhancing both training efficiency and model interpretability.
📝 Abstract
We study the fundamental optimization principles of self-attention, the defining mechanism of transformers, by analyzing the implicit bias of gradient-based optimizers in training a self-attention layer with a linear decoder in binary classification. Building on prior studies in linear logistic regression, recent findings demonstrate that the key-query matrix $W_t$ from gradient-descent (GD) converges in direction towards $W_{mm}$, which maximizes the margin between optimal and non-optimal tokens across sequences. However, this convergence is local, dependent on initial conditions, only holds asymptotically as the number of iterations increases, and leaves questions about the potential benefits of adaptive step-size rules unaddressed. To bridge this gap, we first establish scenarios for which convergence is provably emph{global}. We then analyze two adaptive step-size strategies: normalized GD and Polyak step-size, demonstrating emph{finite-time} convergence rates for $W_t$ to $W_{mm}$, and quantifying the sparsification rate of the attention map. These findings not only show that these strategies can accelerate parameter convergence over standard GD in a non-convex setting but also deepen the understanding of the implicit bias in self-attention, linking it more closely to the phenomena observed in linear logistic regression despite its intricate non-convex nature.