🤖 AI Summary
This work investigates the loss landscape of shallow ReLU-like single-hidden-layer networks under squared loss, focusing on the structure of directionally stationary points and their impact on training dynamics. Methodologically, it introduces a first-order “escaping neuron” criterion that rigorously distinguishes non-minimizing stationary points—including saddle points—from local minima: a stationary point without escaping neurons must be a local minimum; for scalar outputs, the existence of an escaping neuron certifies non-local-minimality. This criterion captures previously overlooked saddle-point escape mechanisms, refining the characterization of saddle-to-saddle training trajectories under small initialization. Furthermore, leveraging network embedding theory, the paper models and proves how embedding transforms and reshapes the distribution of stationary points. The analysis integrates directional stationarity, piecewise-smooth optimization, and gradient descent dynamical modeling.
📝 Abstract
In this paper, we study the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss using gradient descent (GD). We identify the stationary points of such networks, which significantly slow down loss decrease during training. To capture such points while accounting for the non-differentiability of the loss, the stationary points that we study are directional stationary points, rather than other notions like Clarke stationary points. We show that, if a stationary point does not contain"escape neurons", which are defined with first-order conditions, it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks: By precluding the saddle escape types that previous works did not rule out, we advance one step closer to a complete picture of the entire dynamics. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network with a wider network, reshapes the stationary points.