🤖 AI Summary
This work addresses the trade-off between training stability and model performance in deep Transformers, where PreNorm offers stable training but limited expressiveness, while PostNorm achieves superior performance yet suffers from gradient anomalies and representation collapse. To reconcile these issues, the authors propose SpanNorm, a novel normalization scheme that stabilizes signal propagation through residual connections spanning the entire Transformer block and incorporates PostNorm-style output normalization. This approach uniquely unifies the stability of PreNorm with the high performance of PostNorm, supported by theoretical analysis and a variance-aware scaling strategy. Experiments demonstrate that SpanNorm consistently outperforms existing normalization methods in both dense and Mixture-of-Experts (MoE) architectures, enabling more robust and effective training.
📝 Abstract
The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm''architecture ensures training stability at the cost of potential performance degradation in deep models, while the ``PostNorm''architecture offers strong performance but suffers from severe training instability. In this work, we propose SpanNorm, a novel technique designed to resolve this dilemma by integrating the strengths of both paradigms. Structurally, SpanNorm establishes a clean residual connection that spans the entire transformer block to stabilize signal propagation, while employing a PostNorm-style computation that normalizes the aggregated output to enhance model performance. We provide a theoretical analysis demonstrating that SpanNorm, combined with a principled scaling strategy, maintains bounded signal variance throughout the network, preventing the gradient issues that plague PostNorm models, and also alleviating the representation collapse of PreNorm. Empirically, SpanNorm consistently outperforms standard normalization schemes in both dense and Mixture-of-Experts (MoE) scenarios, paving the way for more powerful and stable Transformer architectures.