🤖 AI Summary
Pre-LayerNorm Transformers (e.g., LLaMA, Qwen, DeepSeek) offer stable and scalable pretraining for large language models but suffer from exponential growth of activation variance across layers, causing residual paths to dominate sublayer outputs and hindering deep representation learning. To address this, we propose Gradient-Preserving Activation Scaling (GPAS): a layer-adaptive scaling of intermediate activations during forward propagation that rigorously preserves backward gradients—without altering network architecture, optimizer, or training dynamics. GPAS is compatible with Pre-LN, Sandwich-LN, and DeepNorm configurations. Evaluated on language model pretraining across 71M–1B parameter scales, GPAS accelerates convergence by 18% on average and consistently improves downstream task performance by +0.8% on average, effectively mitigating degradation in deep-layer training.
📝 Abstract
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.