Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

216K/year
🤖 AI Summary
This work addresses non-convex stochastic optimization under heavy-tailed noise by proposing an algorithm that is compatible with model parameterization, scale-invariant, and leverages the geometric structure of input–output matrix norms. The study establishes, for the first time, a dimension-dependent lower bound of Ω(min{m,n}·ε^{-(3p−2)/(p−1)}) for scale-invariant first-order methods under spectral norm in the presence of heavy-tailed noise, and designs a batched Scion algorithm that matches this bound. Furthermore, a transport-type Scion method is introduced, which exploits second-order smoothness under a Hessian Lipschitz condition to improve the convergence complexity to O(min{m,n}·ε^{-(5p−3)/(2p−2)}). Both theoretical analysis and empirical experiments demonstrate the superiority of the proposed approaches.
📝 Abstract
A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $ε$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $Ω(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}ε^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}ε^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.
Problem

Research questions and friction points this paper is trying to address.

scale-invariant optimization
heavy-tailed noise
matrix norm geometry
dimension dependence
nonconvex stochastic optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

scale-invariant optimization
heavy-tailed noise
spectral norm
nonconvex stochastic optimization
higher-order smoothness