Spike No More: Stabilizing the Pre-training of Large Language Models

๐Ÿ“… 2023-12-28
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 15
โœจ Influential: 5
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Sudden loss spikes during large language model (LLM) pretraining severely impair training stability and efficiency. Method: Grounded in the spectral norm of the Jacobian matrix, this work theoretically identifies uncontrolled sublayer gradient norms as the root cause of spikes and proposes a dual-condition stability criterion: โ€œsmall sublayersโ€ coupled with โ€œlarge residual connections.โ€ We develop a spectral-norm-based gradient dynamics model and introduce sublayer scaling and residual weight reparameterization strategies. Contribution/Results: Rigorously validated across diverse architectures (LLaMA, GPT) and scales (1Bโ€“7B), our approach eliminates loss spikes entirely, achieves 100% pretraining success rate, improves final model performance by 1.2โ€“2.8 percentage points on average, and significantly reduces computational waste from training collapses. This work transcends heuristic hyperparameter tuning, establishing an interpretable, verifiable theoretical foundation and practical framework for stable LLM training.
๐Ÿ“ Abstract
Loss spikes often occur during pre-training of large language models. The spikes degrade the performance of large language models and sometimes ruin the pre-training. Since the pre-training needs a vast computational budget, we should avoid such spikes. Based on the assumption that the loss spike is caused by the sudden growth of the gradient norm, we explore factors to keep the gradient norm small through an analysis of the spectral norms of the Jacobian matrices for the sub-layers. Our findings suggest that stabilizing the pre-training process requires two conditions: small sub-layers and large shortcut. We conduct various experiments to empirically verify our theoretical analyses. Experimental results demonstrate that methods satisfying the conditions effectively prevent loss spikes during pre-training.
Problem

Research questions and friction points this paper is trying to address.

Preventing loss spikes in large language model pre-training
Analyzing gradient norm growth causes in model training
Validating conditions for stable pre-training process
Innovation

Methods, ideas, or system contributions that make the work stand out.

Control gradient norm via Jacobian spectral norms
Ensure small sub-layers for stability
Use large shortcuts to prevent loss spikes
๐Ÿ”Ž Similar Papers
No similar papers found.