The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies significant and persistent structural differences in sharpness across Transformer modules—embedding, normalization, self-attention, and feed-forward networks—early in training, which hinder LLM pretraining efficiency. To address this, we propose the first module-level sharpness-adaptive learning rate scheduling scheme: sharpness is dynamically estimated per module to quantify its local curvature, and learning rates are scaled accordingly via Blockwise AdamW and Adam-mini. Evaluated on GPT-2 and LLaMA models (0.12B–1.1B parameters), our method achieves lower final loss and accelerates training by nearly 2×. When combined with Adam-mini, it further doubles training speed and reduces GPU memory consumption by 2×, while preserving training stability. The approach bridges module-wise optimization dynamics with adaptive curvature-aware learning, enabling more efficient and scalable LLM pretraining.

Technology Category

Application Category

📝 Abstract
Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2 imes$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2 imes$ speedup and $2 imes$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.
Problem

Research questions and friction points this paper is trying to address.

Uncover Sharpness Disparity in Transformer blocks
Propose Blockwise Learning Rate for LLM pre-training
Achieve speedup and memory saving in model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Blockwise Learning Rate
Sharpness Disparity Principle
Adam-mini integration
🔎 Similar Papers
Jinbo Wang
Jinbo Wang
Texas A & M University
Ocean dynamics
Mingze Wang
Mingze Wang
School of Mathematical Sciences, Peking University
Machine Learning TheoryDeep Learning TheoryOptimization
Zhanpeng Zhou
Zhanpeng Zhou
Shanghai Jiao Tong University
Deep Learning Theory
Junchi Yan
Junchi Yan
FIAPR & ICML Board Member, SJTU (2018-), SII (2024-), AWS (2019-2022), IBM (2011-2018)
Computational IntelligenceAI4ScienceMachine LearningAutonomous Driving
E
E Weinan
School of Mathematical Sciences, Peking University; Center for Machine Learning Research, Peking University; AI for Science Institute, Beijing
L
Lei Wu
School of Mathematical Sciences, Peking University; Center for Machine Learning Research, Peking University