GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Pre-LayerNorm Transformers (e.g., LLaMA, Qwen, DeepSeek) offer stable and scalable pretraining for large language models but suffer from exponential growth of activation variance across layers, causing residual paths to dominate sublayer outputs and hindering deep representation learning. To address this, we propose Gradient-Preserving Activation Scaling (GPAS): a layer-adaptive scaling of intermediate activations during forward propagation that rigorously preserves backward gradients—without altering network architecture, optimizer, or training dynamics. GPAS is compatible with Pre-LN, Sandwich-LN, and DeepNorm configurations. Evaluated on language model pretraining across 71M–1B parameter scales, GPAS accelerates convergence by 18% on average and consistently improves downstream task performance by +0.8% on average, effectively mitigating degradation in deep-layer training.

Technology Category

Application Category

📝 Abstract

Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.

Problem

Research questions and friction points this paper is trying to address.

Pre-LN suffers exponential activation variance growth

Residual path dominates sub-layer outputs

Limits learning capacity of deeper layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales activations while preserving gradients

Enhances Pre-LN Transformer performance

Applicable to diverse architectures

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models