π€ AI Summary
This work proposes Progressive Residual Warmup (ProRes) to address the challenges of training instability and slow convergence in large language model pretraining. ProRes progressively activates residual connections layer by layer during early training, implementing a βshallow-firstβ learning mechanism that enables gradual, layer-wise residual scaling. This approach steers optimization along a more favorable trajectory, enhancing convergence behavior. Built upon the Transformer architecture, ProRes integrates a hierarchical residual scaling strategy with tailored normalization and initialization schemes. Experiments across multiple model scales demonstrate that ProRes significantly improves training stability, accelerates convergence, and boosts both generalization capability and downstream task performance.
π Abstract
Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an"early layer learns first"philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.