🤖 AI Summary
This work identifies significant and persistent structural differences in sharpness across Transformer modules—embedding, normalization, self-attention, and feed-forward networks—early in training, which hinder LLM pretraining efficiency. To address this, we propose the first module-level sharpness-adaptive learning rate scheduling scheme: sharpness is dynamically estimated per module to quantify its local curvature, and learning rates are scaled accordingly via Blockwise AdamW and Adam-mini. Evaluated on GPT-2 and LLaMA models (0.12B–1.1B parameters), our method achieves lower final loss and accelerates training by nearly 2×. When combined with Adam-mini, it further doubles training speed and reduces GPU memory consumption by 2×, while preserving training stability. The approach bridges module-wise optimization dynamics with adaptive curvature-aware learning, enabling more efficient and scalable LLM pretraining.
📝 Abstract
Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2 imes$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2 imes$ speedup and $2 imes$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.