Scaling and Transferability of Annealing Strategies in Large Language Model Training

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Learning rate annealing strategies in large language model (LLM) training suffer from poor scalability and limited cross-model transferability. Method: We propose a general predictive framework based on a Warmup–Steady–Decay three-phase schedule. By systematically analyzing training dynamics of Dense and Mixture-of-Experts (MoE) architectures across multiple scales, we empirically establish the cross-model transferability of annealing policies for the first time, and develop a generalized predictive model integrating training steps, peak learning rate, and annealing behavior. Contribution/Results: Our method enables efficient inference of optimal annealing ratios for large models from small-scale experiments—without exhaustive hyperparameter search. Validated on 10-billion-parameter models, it achieves high-fidelity transfer, improving convergence speed by 12–18% and significantly enhancing training stability.

Technology Category

Application Category

📝 Abstract

Learning rate scheduling is crucial for training large language models, yet understanding the optimal annealing strategies across different model configurations remains challenging. In this work, we investigate the transferability of annealing dynamics in large language model training and refine a generalized predictive framework for optimizing annealing strategies under the Warmup-Steady-Decay (WSD) scheduler. Our improved framework incorporates training steps, maximum learning rate, and annealing behavior, enabling more efficient optimization of learning rate schedules. Our work provides a practical guidance for selecting optimal annealing strategies without exhaustive hyperparameter searches, demonstrating that smaller models can serve as reliable proxies for optimizing the training dynamics of larger models. We validate our findings on extensive experiments using both Dense and Mixture-of-Experts (MoE) models, demonstrating that optimal annealing ratios follow consistent patterns and can be transferred across different training configurations.

Problem

Research questions and friction points this paper is trying to address.

Investigates annealing strategies transferability in large language model training

Refines a predictive framework to optimize learning rate schedules efficiently

Provides guidance for selecting optimal annealing strategies without exhaustive searches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized predictive framework for annealing optimization

Incorporates training steps, learning rate, annealing behavior

Smaller models as proxies for larger model training dynamics

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models