Adam Exploits $ell_infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

📅 2024-10-10

📈 Citations: 4

✨ Influential: 1

career value

195K/year

🤖 AI Summary

This paper investigates the theoretical underpinnings of Adam’s superior empirical performance over SGD in language model training. While conventional analyses rely on ℓ₂-smoothness, such assumptions fail to capture observed gradient heterogeneity and sparsity across coordinates. Method: We identify and empirically validate an ℓ∞-geometric structure in the loss landscape—characterized by heterogeneous and sparse coordinate-wise gradient variations—and develop the first adaptive optimization convergence analysis framework based on ℓ∞-smoothness. Contribution/Results: We rigorously prove that Adam’s coordinate-wise adaptive step sizes exploit this structure, achieving a convergence rate strictly faster than the standard non-convex lower bound; in contrast, SGD is insensitive to it, yielding robust but suboptimal performance. We further propose a block-wise Adam variant and verify our theory on GPT-2 and ResNet: artificially disrupting the ℓ∞-geometry causes a sharp degradation in Adam’s performance while SGD remains stable—confirming this geometry as the key mechanism behind Adam’s advantage.

Technology Category

Application Category

📝 Abstract

Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $widetilde{O}(T^{-1/4})$. In this work, we argue that the exploitation of nice $ell_infty$-geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $ell_infty$-geometry rather than the more common $ell_2$-geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $ell_infty$-geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel blockwise smoothness assumptions.

Problem

Research questions and friction points this paper is trying to address.

Understands Adam's advantage over SGD via l∞-geometry exploitation

Analyzes Adam's convergence under novel l∞-smoothness assumptions

Tests Adam's performance when favorable l∞-geometry is altered

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adam exploits ell_infty-geometry of loss landscape

Convergence analysis under ell_infty smoothness assumptions

Blockwise Adam analyzed with blockwise smoothness assumptions

🔎 Similar Papers

GeoAdaLer: Geometric Insights into Adaptive Stochastic Gradient Descent Algorithms