Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
📝 Abstract
We provide a theoretical analysis of Adam under non-stationary stochastic objectives, separating two regimes: Euclidean tracking under adaptive strong monotonicity of the Adam-preconditioned mean-gradient operator, and high-probability projected stationarity guarantees under general $L$-smooth objectives. In the tracking regime, we derive finite-time expected and high-probability bounds that decompose sharply into four components: initialization, objective drift, a first-moment tracking error governed by $β_1$, and a preconditioner perturbation governed by $β_2$. We characterize the burn-in time to reach Adam's irreducible tracking floor under constant and step-decay schedules. We also prove a high-probability bound on the average projected stationarity gap for Adam under distribution shift. Across both analyses, our bounds reveal a noise--drift tradeoff: in noise-dominated regimes, first-moment averaging and adaptive preconditioning can improve the high-probability error, whereas in drift-dominated regimes, stale first-moment information and preconditioner perturbations can compound the cost of nonstationarity, allowing vanilla SGD to achieve a smaller tracking floor. Our explicit $(β_1,β_2,ε)$-dependent bounds delineate when adaptive step-sizing is beneficial versus harmful, and provide a theoretical mechanism for Adam's empirical instability and stabilization under distribution shift.
Problem

Research questions and friction points this paper is trying to address.

nonstationary optimization
Adam
SGD
adaptive methods
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

nonstationary optimization
Adam
SGD
adaptive preconditioning
tracking error