🤖 AI Summary
Prior analyses of Transformer convergence under gradient descent treat self-attention, feed-forward networks, and residual connections in isolation, failing to capture their structural interplay—particularly how residual connections mitigate the low-rank degeneracy induced by softmax in attention outputs.
Method: We conduct a unified theoretical analysis of single- and multi-layer residual Transformers, explicitly modeling the coupling among attention, feed-forward layers, and residual connections.
Contribution/Results: We provide the first theoretical proof that residual connections alleviate the ill-conditioning of attention output matrices by improving their singular value spectrum and reducing condition number. Building on this, we establish the first linear convergence guarantee for full-architecture Transformers under gradient descent, showing that—under appropriate initialization—the convergence rate is governed by the smallest non-zero singular value of the attention output matrix. Empirical results confirm that residual connections substantially enhance training stability and accelerate convergence.
📝 Abstract
Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.