🤖 AI Summary
This work addresses the instability in training Transformers without skip connections, where standard Softmax self-attention often leads to rank collapse and ill-conditioned Jacobians. To overcome this, the authors propose Orthogonal Self-Attention, which explicitly constrains the attention matrix to be orthogonal by mapping a skew-symmetric matrix through the matrix exponential. Combined with a low-rank structure and a tailored initialization scheme, this approach avoids rank collapse while maintaining linear computational complexity. The method incurs memory and computational costs that scale linearly with sequence length and is theoretically shown to yield a well-conditioned Jacobian matrix. As a result, it significantly enhances the trainability of Transformers lacking residual connections and normalization layers.
📝 Abstract
Softmax Self-Attention (SSA) is a key component of Transformer architectures. However, when utilised within skipless architectures, which aim to improve representation learning, recent work has highlighted the inherent instability of SSA due to inducing rank collapse and poorly-conditioned Jacobians. In this work, we design a novel attention mechanism: Orthogonal Self-Attention (OSA), which aims to bypass these issues with SSA, in order to allow for (non-causal) Transformers without skip connections and normalisation layers to be more easily trained. In particular, OSA parametrises the attention matrix to be orthogonal via mapping a skew-symmetric matrix, formed from query-key values, through the matrix exponential. We show that this can be practically implemented, by exploiting the low-rank structure of our query-key values, resulting in the computational complexity and memory cost of OSA scaling linearly with sequence length. Furthermore, we derive an initialisation scheme for which we prove ensures that the Jacobian of OSA is well-conditioned.