The Sharpness Disparity Principle in Transformers for Accelerating Language Model Pre-Training

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work identifies significant and persistent structural differences in sharpness across Transformer modules—embedding, normalization, self-attention, and feed-forward networks—early in training, which hinder LLM pretraining efficiency. To address this, we propose the first module-level sharpness-adaptive learning rate scheduling scheme: sharpness is dynamically estimated per module to quantify its local curvature, and learning rates are scaled accordingly via Blockwise AdamW and Adam-mini. Evaluated on GPT-2 and LLaMA models (0.12B–1.1B parameters), our method achieves lower final loss and accelerates training by nearly 2×. When combined with Adam-mini, it further doubles training speed and reduces GPU memory consumption by 2×, while preserving training stability. The approach bridges module-wise optimization dynamics with adaptive curvature-aware learning, enabling more efficient and scalable LLM pretraining.

Technology Category

Application Category

📝 Abstract

Transformers consist of diverse building blocks, such as embedding layers, normalization layers, self-attention mechanisms, and point-wise feedforward networks. Thus, understanding the differences and interactions among these blocks is important. In this paper, we uncover a clear Sharpness Disparity across these blocks, which emerges early in training and intriguingly persists throughout the training process. Motivated by this finding, we propose Blockwise Learning Rate (LR), a strategy that tailors the LR to each block's sharpness, accelerating large language model (LLM) pre-training. By integrating Blockwise LR into AdamW, we consistently achieve lower terminal loss and nearly $2 imes$ speedup compared to vanilla AdamW. We demonstrate this acceleration across GPT-2 and LLaMA, with model sizes ranging from 0.12B to 1.1B and datasets of OpenWebText and MiniPile. Finally, we incorporate Blockwise LR into Adam-mini (Zhang et al., 2024), a recently proposed memory-efficient variant of Adam, achieving a combined $2 imes$ speedup and $2 imes$ memory saving. These results underscore the potential of exploiting the sharpness disparity to improve LLM training.

Problem

Research questions and friction points this paper is trying to address.

Uncover Sharpness Disparity in Transformer blocks

Propose Blockwise Learning Rate for LLM pre-training

Achieve speedup and memory saving in model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Blockwise Learning Rate

Sharpness Disparity Principle

Adam-mini integration

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models

2023-12-28arXiv.orgCitations: 15

Authors to Follow