Optimization Hyper-parameter Laws for Large Language Models

📅 2024-09-07

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Efficient selection of dynamic hyperparameters—such as learning rate—remains challenging in large language model (LLM) training. Method: This paper proposes Opt-Laws, a novel framework grounded in stochastic differential equations (SDEs), which establishes the first interpretable mathematical relationship between hyperparameter dynamics and training loss. Opt-Laws enables *a priori* prediction of optimal learning rate schedules across pretraining, continued training, and fine-tuning. It integrates SDE-based modeling, hyperparameter–loss function fitting, and multi-scale empirical validation. Contribution/Results: Evaluated across diverse model scales and dataset sizes, Opt-Laws achieves high-accuracy training loss prediction. Experiments demonstrate that it substantially reduces hyperparameter search overhead, shortens tuning cycles by multiple-fold, and improves final model performance.

Technology Category

Application Category

📝 Abstract

Large Language Models have driven significant AI advancements, yet their training is resource-intensive and highly sensitive to hyper-parameter selection. While scaling laws provide valuable guidance on model size and data requirements, they fall short in choosing dynamic hyper-parameters, such as learning-rate (LR) schedules, that evolve during training. To bridge this gap, we present Optimization Hyper-parameter Laws (Opt-Laws), a framework that effectively captures the relationship between hyper-parameters and training outcomes, enabling the pre-selection of potential optimal schedules. Grounded in stochastic differential equations, Opt-Laws introduce novel mathematical interpretability and offer a robust theoretical foundation for some popular LR schedules. Our extensive validation across diverse model sizes and data scales demonstrates Opt-Laws' ability to accurately predict training loss and identify optimal LR schedule candidates in pre-training, continual training, and fine-tuning scenarios. This approach significantly reduces computational costs while enhancing overall model performance.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Hyperparameter Optimization

Learning Rate Scheduling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Opt-Laws

Automatic Hyperparameter Optimization

Resource-Efficient Training

🔎 Similar Papers

No similar papers found.