GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Pre-LayerNorm Transformers (e.g., LLaMA, Qwen, DeepSeek) offer stable and scalable pretraining for large language models but suffer from exponential growth of activation variance across layers, causing residual paths to dominate sublayer outputs and hindering deep representation learning. To address this, we propose Gradient-Preserving Activation Scaling (GPAS): a layer-adaptive scaling of intermediate activations during forward propagation that rigorously preserves backward gradients—without altering network architecture, optimizer, or training dynamics. GPAS is compatible with Pre-LN, Sandwich-LN, and DeepNorm configurations. Evaluated on language model pretraining across 71M–1B parameter scales, GPAS accelerates convergence by 18% on average and consistently improves downstream task performance by +0.8% on average, effectively mitigating degradation in deep-layer training.

Technology Category

Application Category

📝 Abstract
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the residual path to dominate over sub-layer outputs and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings.
Problem

Research questions and friction points this paper is trying to address.

Pre-LN suffers exponential activation variance growth
Residual path dominates sub-layer outputs
Limits learning capacity of deeper layers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scales activations while preserving gradients
Enhances Pre-LN Transformer performance
Applicable to diverse architectures
🔎 Similar Papers
No similar papers found.
Tianhao Chen
Tianhao Chen
Phd student, Zhejiang University
Geotechnical engineering
X
Xin Xu
The Hong Kong University of Science and Technology
Z
Zijing Liu
International Digital Economy Academy
Pengxiang Li
Pengxiang Li
Beijing Institute of Technology
Multimodal AgentVision and Language3DVHyperbolic Learning
X
Xinyuan Song
Emory University
A
Ajay Kumar Jaiswal
University of Texas at Austin
F
Fan Zhang
The Hong Kong University of Science and Technology
J
Jishan Hu
The Hong Kong University of Science and Technology
Y
Yang Wang
The Hong Kong University of Science and Technology
H
Hao Chen
The Hong Kong University of Science and Technology
Shizhe Diao
Shizhe Diao
NVIDIA Research
Large Language ModelsNatural Language Processing
S
Shiwei Liu
University of Oxford
Y
Yu Li
International Digital Economy Academy
Y
Yin Lu
University of Surrey
Can Yang
Can Yang
Hong Kong University of Science and Technology
Statistical Machine LearningStatistical Genetics and Genomics