🤖 AI Summary
This work addresses model collapse in Transformer training without learning rate warmup, identifying its root cause as malignant entropy collapse induced by excessive spectral energy concentration in the product $W_q^ op W_k$ of query and key projection matrices. We establish the first spectral-theoretic characterization of this phenomenon and, grounded in Weyl’s inequality, propose an adaptive learning rate clipping strategy to dynamically suppress unidirectional spectral energy accumulation. Additionally, we introduce a weight update smoothing constraint to enhance optimization stability. The method is validated across ViT, Swin, and GPT architectures, enabling stable end-to-end training without warmup. It significantly improves convergence robustness and training efficiency. Our approach provides an interpretable, theoretically grounded, and broadly applicable framework for warmup-free Transformer optimization—unifying mechanistic insight with practical efficacy.
📝 Abstract
Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and using an obviously lower learning rate is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for the process of training Transformer and reveal the rationale behind the model crash phenomenon in the training process, termed extit{spectral energy concentration} of ${W_q}^{ op} W_k$, which is the reason for a malignant entropy collapse, where ${W_q}$ and $W_k$ are the projection matrices for the query and the key in Transformer, respectively. To remedy this problem, motivated by extit{Weyl's Inequality}, we present a novel optimization strategy, ie, making the weight updating in successive steps smooth -- if the ratio $frac{sigma_{1}(
abla W_t)}{sigma_{1}(W_{t-1})}$ is larger than a threshold, we will automatically bound the learning rate to a weighted multiple of $frac{sigma_{1}(W_{t-1})}{sigma_{1}(
abla W_t)}$, where $
abla W_t$ is the updating quantity in step $t$. Such an optimization strategy can prevent spectral energy concentration to only a few directions, and thus can avoid malignant entropy collapse which will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these Transformers without using learning rate warmup.