How to Set the Batch Size for Large-Scale Pre-training?

📅 2026-01-08
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inadequacy of traditional critical batch size theory under Warmup-Stable-Decay (WSD) learning rate schedules, which fails to guide batch size selection in large-scale pretraining. By re-modeling the relationship between data consumption and training steps specific to WSD dynamics, the study proposes the first theoretical framework tailored to WSD scheduling. It introduces a minimum batch threshold \(B_{\min}\) and an optimal batch size \(B_{\text{opt}}\), and establishes two key properties governing their behavior. Building on this foundation, the authors design a dynamic batch size scheduling strategy that substantially improves both training efficiency and model performance in large-scale experiments.

Technology Category

Application Category

📝 Abstract
The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.
Problem

Research questions and friction points this paper is trying to address.

Batch Size
Large-Scale Pre-training
WSD Scheduler
Critical Batch Size
Training Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Critical Batch Size
Warmup-Stable-Decay scheduler
data efficiency
dynamic batch size scheduling
large-scale pre-training
🔎 Similar Papers
No similar papers found.
Yunhua Zhou
Yunhua Zhou
Fudan University
Machine LearningNatural Language Processing
Junhao Huang
Junhao Huang
Victoria University of Wellington
Neural Architecture SearchDeep Neural NetworksEvolutionary Computation
S
Shuhao Xin
Shanghai AI Laboratory, Fudan University
Y
Yechen Zhang
Shanghai AI Laboratory, Shanghai JiaoTong University
R
Runyu Peng
Shanghai AI Laboratory
Q
Qiping Guo
Shanghai AI Laboratory
X
Xipeng Qiu
Fudan University