How Does Critical Batch Size Scale in Pre-training?

📅 2024-10-29

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work investigates the scaling behavior of the critical batch size (CBS) in pre-trained large language models and its implications for parallel training optimization. We systematically train autoregressive language models ranging from 85M to 1.2B parameters on the C4 dataset, decoupling the effects of model size and dataset size. Contrary to prevailing assumptions, we empirically establish— for the first time—that CBS scales linearly with dataset size, not model size. Leveraging infinite-width neural network theory and infinite-dimensional least-squares analysis, we provide a rigorous theoretical justification for this phenomenon. We further propose a reproducible empirical framework for CBS measurement, revealing strong sensitivity of CBS to hyperparameters—particularly learning rate and momentum—beyond fixed wall-clock training time. Our findings deliver both theoretical foundations and practical guidelines for resource allocation, distributed training strategy design, and hyperparameter tuning in large-scale pre-training.

Technology Category

Application Category

📝 Abstract

Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

Problem

Research questions and friction points this paper is trying to address.

Determine critical batch size scaling

Optimize resource use in model training

Analyze data vs model size impact

Innovation

Methods, ideas, or system contributions that make the work stand out.

Measures critical batch size

Pre-trains auto-regressive models

Analyzes infinite-width network limits

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models