🤖 AI Summary
Efficient selection of dynamic hyperparameters—such as learning rate—remains challenging in large language model (LLM) training. Method: This paper proposes Opt-Laws, a novel framework grounded in stochastic differential equations (SDEs), which establishes the first interpretable mathematical relationship between hyperparameter dynamics and training loss. Opt-Laws enables *a priori* prediction of optimal learning rate schedules across pretraining, continued training, and fine-tuning. It integrates SDE-based modeling, hyperparameter–loss function fitting, and multi-scale empirical validation. Contribution/Results: Evaluated across diverse model scales and dataset sizes, Opt-Laws achieves high-accuracy training loss prediction. Experiments demonstrate that it substantially reduces hyperparameter search overhead, shortens tuning cycles by multiple-fold, and improves final model performance.
📝 Abstract
Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.