🤖 AI Summary
Conventional Transformer training employs a uniform learning rate across all model components, ignoring their dynamic heterogeneity in parameter sensitivity and update magnitude—leading to suboptimal optimization efficiency.
Method: We propose a dynamic decoupled learning rate scheduling framework, introducing— for the first time—the concept of *relative learning rates*, which adaptively scale per-component learning rates based on layer-specific gradient statistics and architectural roles. Our approach is architecture-agnostic within the Transformer family and integrates seamlessly with Mixture of Experts (MoE) configurations.
Contribution/Results: The method enables direct hyperparameter transfer across model scales—from small baselines to models 27× larger—without manual retuning. Empirical evaluation demonstrates up to 23% faster convergence for complex models and substantial reductions in computational resource consumption. This work establishes a scalable, efficient, and broadly generalizable optimization paradigm for large-scale neural networks.
📝 Abstract
In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27 imes$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.