The Optimization Landscape of SGD Across the Feature Learning Strength

📅 2024-10-06
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the regulatory mechanism of the final-layer scaling hyperparameter γ on feature learning strength in neural networks, and its coupling with the learning rate η—particularly in the “super-rich” regime (γ ≫ 1). Using online SGD training, empirical loss landscape analysis, and systematic γ–η two-dimensional scaling experiments—complemented by theoretical modeling—we discover, for the first time, a piecewise power-law scaling of the optimal learning rate: η* ∝ γ² for small γ and η* ∝ γ^{2/L} for large γ, where L denotes network depth. We identify a novel optimization paradigm in the super-rich regime: “long plateau → sharp drop → staircase-like convergence”, and demonstrate that loss trajectories become universal under temporal rescaling. Empirical results show that proper tuning of γ substantially improves online performance, whereas neglecting this hyperparameter leads to severe underestimation of model capability.

Technology Category

Application Category

📝 Abstract
We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $gamma$. Recent work has identified $gamma$ as controlling the strength of feature learning. As $gamma$ increases, network evolution changes from"lazy"kernel dynamics to"rich"feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $gamma$ across a variety of models and datasets in the online training setting. We first examine the interaction of $gamma$ with the learning rate $eta$, identifying several scaling regimes in the $gamma$-$eta$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $eta^*$ scales non-trivially with $gamma$. In particular, $eta^* propto gamma^2$ when $gamma ll 1$ and $eta^* propto gamma^{2/L}$ when $gamma gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored"ultra-rich"$gamma gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $gamma$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $gamma$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.
Problem

Research questions and friction points this paper is trying to address.

Investigates the impact of feature learning strength (γ) on neural network training dynamics.
Explores the relationship between learning rate (η) and feature learning strength (γ).
Examines the optimization behavior in the under-explored ultra-rich (γ >> 1) regime.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scaling final layer by hyperparameter γ controls feature learning.
Optimal learning rate η* scales non-trivially with γ.
Ultra-rich γ regime shows unique loss curve patterns.
🔎 Similar Papers
No similar papers found.
A
Alexander B. Atanasov
Department of Physics, Harvard University; Center for Brain Science, Harvard University; School of Engineering and Applied Science, Harvard University
Alexandru Meterez
Alexandru Meterez
Harvard University
Machine LearningDeep Learning TheoryOptimization
James B. Simon
James B. Simon
UC Berkeley
Deep LearningTheoretical PhysicsStatistical Mechanics
C
C. Pehlevan
School of Engineering and Applied Science, Harvard University; Center for Brain Science, Harvard University; Kempner Institute, Harvard University