🤖 AI Summary
This work investigates the regulatory mechanism of the final-layer scaling hyperparameter γ on feature learning strength in neural networks, and its coupling with the learning rate η—particularly in the “super-rich” regime (γ ≫ 1). Using online SGD training, empirical loss landscape analysis, and systematic γ–η two-dimensional scaling experiments—complemented by theoretical modeling—we discover, for the first time, a piecewise power-law scaling of the optimal learning rate: η* ∝ γ² for small γ and η* ∝ γ^{2/L} for large γ, where L denotes network depth. We identify a novel optimization paradigm in the super-rich regime: “long plateau → sharp drop → staircase-like convergence”, and demonstrate that loss trajectories become universal under temporal rescaling. Empirical results show that proper tuning of γ substantially improves online performance, whereas neglecting this hyperparameter leads to severe underestimation of model capability.
📝 Abstract
We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter $gamma$. Recent work has identified $gamma$ as controlling the strength of feature learning. As $gamma$ increases, network evolution changes from"lazy"kernel dynamics to"rich"feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling $gamma$ across a variety of models and datasets in the online training setting. We first examine the interaction of $gamma$ with the learning rate $eta$, identifying several scaling regimes in the $gamma$-$eta$ plane which we explain theoretically using a simple model. We find that the optimal learning rate $eta^*$ scales non-trivially with $gamma$. In particular, $eta^* propto gamma^2$ when $gamma ll 1$ and $eta^* propto gamma^{2/L}$ when $gamma gg 1$ for a feed-forward network of depth $L$. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored"ultra-rich"$gamma gg 1$ regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large $gamma$ values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large $gamma$ and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-$gamma$ limit may yield useful insights into the dynamics of representation learning in performant models.