Understanding Why Adam Outperforms SGD: Gradient Heterogeneity in Transformers

📅 2025-01-31

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work investigates why Adam consistently outperforms SGD in Transformer training, focusing on gradient heterogeneity—the pronounced disparity in gradient magnitudes across parameters—and its impact on optimization dynamics. We formally define and quantify this phenomenon for the first time, demonstrating that severe gradient heterogeneity critically impedes SGD convergence while leaving sign-based optimization (e.g., signSGD) robust. We identify LayerNorm placement as a key architectural lever controlling gradient heterogeneity and uncover a novel role of momentum in sign-based optimizers: suppressing anomalous growth in linear head parameters. Through theoretical analysis, empirical gradient statistics, ablation studies with sign-based variants, and cross-task evaluations across multiple NLP and vision models under fine-tuning regimes, we validate our findings. Our results advance the fundamental understanding of adaptive optimization and provide both theoretical foundations and practical guidelines for designing lightweight, interpretable optimizers.

Technology Category

Application Category

📝 Abstract

Transformer models are challenging to optimize with SGD and typically require adaptive optimizers such as Adam. However, the reasons behind the superior performance of Adam over SGD remain unclear. In this study, we investigate the optimization of transformer models by focusing on emph{gradient heterogeneity}, defined as the disparity in gradient norms among parameters. Our analysis shows that gradient heterogeneity hinders gradient-based optimization, including SGD, while sign-based optimization, a simplified variant of Adam, is less affected. We further examine gradient heterogeneity in transformer models and show that it is influenced by the placement of layer normalization. Additionally, we show that the momentum term in sign-based optimization is important for preventing the excessive growth of linear-head parameters in tasks with many classes. Experimental results from fine-tuning transformer models in both NLP and vision domains validate our theoretical analyses. This study provides insights into the optimization challenges of transformer models and offers guidance for designing future optimization algorithms. Code is available at url{https://github.com/tom4649/gradient-heterogeneity}.

Problem

Research questions and friction points this paper is trying to address.

Transformer models

Adam optimizer

Gradient heterogeneity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer Models

Adam Optimizer

Gradient Heterogeneity

🔎 Similar Papers

No similar papers found.

Authors to Follow