Resolving Discrepancies in Compute-Optimal Scaling of Language Models

📅 2024-06-27
🏛️ arXiv.org
📈 Citations: 11
Influential: 1
📄 PDF
🤖 AI Summary
Significant disagreement exists between the Kaplan and Chinchilla scaling laws regarding optimal compute allocation for language models. Method: We systematically reproduce both scaling laws on OpenWebText2 and RefinedWeb, identifying and quantifying three major practical deviations: final-layer computational overhead, learning-rate warmup duration, and scale-dependent optimizer hyperparameters—particularly AdamW’s β₂. Contribution/Results: We empirically refute Chinchilla’s assumption that learning-rate decay is necessary; instead, we derive a novel joint scaling law linking learning rate, batch size, and β₂. After correction, predictions from both schools converge closely. We establish β₂ as a critical tunable parameter for low-batch-size training and provide actionable guidelines for efficient training of models ranging from 10M to 10B parameters. Our framework enables superior performance–cost trade-offs under fixed compute budgets.

Technology Category

Application Category

📝 Abstract
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e.,"Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $eta_2$ parameter is essential at lower batch sizes.
Problem

Research questions and friction points this paper is trying to address.

Language Model Optimization
Computational Budget
Inconsistency Resolution
Innovation

Methods, ideas, or system contributions that make the work stand out.

language model optimization
computational efficiency
AdamW parameter adjustment
🔎 Similar Papers
No similar papers found.
T
Tomer Porian
Tel Aviv University
Mitchell Wortsman
Mitchell Wortsman
University of Washington
J
J. Jitsev
Jülich Supercomputing Centre (JSC) and LAION
Ludwig Schmidt
Ludwig Schmidt
Stanford University and Anthropic
Machine LearningArtificial IntelligenceOptimizationAlgorithmsStatistics
Y
Y. Carmon
Tel Aviv University