Benignity of loss landscape with weight decay requires both large overparametrization and initialization

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the benign landscape—i.e., absence of spurious local minima—of the training loss for two-layer ReLU networks with weight decay. **Problem:** We characterize when such landscapes are globally benign under varying degrees of over-parameterization and initialization scale. **Method:** Leveraging piecewise-linear analysis, activation pattern partitioning, orthogonal data construction, and geometric characterization of the loss surface, we derive exact conditions for landscape benignity. **Contribution/Results:** We establish that a benign landscape occurs *if and only if* both large over-parameterization (width $m gtrsim min(n^d, 2^n)$) and large initialization hold simultaneously; in contrast, small initialization provably induces spurious local minima, revealing the fundamental role of initialization scale. Our analysis yields the first necessary and sufficient conditions—and precise parameter thresholds—for benign landscapes under weight decay. These results clarify the synergistic interplay between over-parameterization and initialization, and provide a rigorous theoretical foundation for understanding optimization dynamics under implicit regularization.

Technology Category

Application Category

📝 Abstract
The optimization of neural networks under weight decay remains poorly understood from a theoretical standpoint. While weight decay is standard practice in modern training procedures, most theoretical analyses focus on unregularized settings. In this work, we investigate the loss landscape of the $ell_2$-regularized training loss for two-layer ReLU networks. We show that the landscape becomes benign -- i.e., free of spurious local minima -- under large overparametrization, specifically when the network width $m$ satisfies $m gtrsim min(n^d, 2^n)$, where $n$ is the number of data points and $d$ the input dimension. More precisely in this regime, almost all constant activation regions contain a global minimum and no spurious local minima. We further show that this level of overparametrization is not only sufficient but also necessary via the example of orthogonal data. Finally, we demonstrate that such loss landscape results primarily hold relevance in the large initialization regime. In contrast, for small initializations -- corresponding to the feature learning regime -- optimization can still converge to spurious local minima, despite the global benignity of the landscape.
Problem

Research questions and friction points this paper is trying to address.

Understanding neural network optimization under weight decay
Analyzing loss landscape in two-layer ReLU networks
Exploring overparametrization and initialization effects
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large overparametrization ensures benign loss landscape
Orthogonal data necessitates specific overparametrization level
Large initialization prevents spurious local minima
🔎 Similar Papers
No similar papers found.
Etienne Boursier
Etienne Boursier
INRIA Saclay
Machine LearningStatisticsGame Theory
M
Matthew Bowditch
Mathematics Institute, University of Warwick, Coventry, UK
M
Matthias Englert
Department of Computer Science, University of Warwick, Coventry, UK
Ranko Lazic
Ranko Lazic
Department of Computer Science, University of Warwick
theoretical computer science