On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

This work clarifies the fundamental relationship between the noise covariance in stochastic gradient descent (SGD) and the curvature of the loss landscape, correcting the common misconception that equates the Fisher information matrix with the Hessian. Leveraging the Activity–Weight Duality, the authors derive a general relation that does not rely on the negative log-likelihood assumption, demonstrating that the noise covariance approximately commutes with the expected sample-wise Hessian and that their diagonal elements obey a power-law relationship $C_{ii} \propto H_{ii}^\gamma$ with $1 \leq \gamma \leq 2$. Through comprehensive theoretical analysis and extensive experiments across diverse datasets, network architectures, and loss functions, the study validates the universality of this relationship, offering a unified characterization of the intrinsic connection between optimization noise and loss curvature in deep learning.

Technology Category

Application Category

📝 Abstract

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly, and their diagonal elements follow an approximate power-law relation $C_{ii} \propto H_{ii}^{\gamma}$ with a theoretically bounded exponent $1 \leq \gamma \leq 2$, determined by per-sample Hessian spectra. Experiments across datasets, architectures, and loss functions validate these bounds, providing a unified characterization of the noise-curvature relationship in deep learning.

Problem

Research questions and friction points this paper is trying to address.

SGD noise covariance

loss landscape curvature

Hessian matrix

Fisher Information Matrix

flat minima

Innovation

Methods, ideas, or system contributions that make the work stand out.

SGD noise covariance

loss landscape curvature

Activity–Weight Duality