🤖 AI Summary
This work investigates the benign landscape—i.e., absence of spurious local minima—of the training loss for two-layer ReLU networks with weight decay. **Problem:** We characterize when such landscapes are globally benign under varying degrees of over-parameterization and initialization scale. **Method:** Leveraging piecewise-linear analysis, activation pattern partitioning, orthogonal data construction, and geometric characterization of the loss surface, we derive exact conditions for landscape benignity. **Contribution/Results:** We establish that a benign landscape occurs *if and only if* both large over-parameterization (width $m gtrsim min(n^d, 2^n)$) and large initialization hold simultaneously; in contrast, small initialization provably induces spurious local minima, revealing the fundamental role of initialization scale. Our analysis yields the first necessary and sufficient conditions—and precise parameter thresholds—for benign landscapes under weight decay. These results clarify the synergistic interplay between over-parameterization and initialization, and provide a rigorous theoretical foundation for understanding optimization dynamics under implicit regularization.
📝 Abstract
The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m gtrsim min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the global benignity of the landscape.