SGD with Clipping is Secretly Estimating the Median Gradient

📅 2024-02-20

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

254K/year

🤖 AI Summary

This work addresses robust optimization in distributed learning under Byzantine failures, data heterogeneity, differential privacy constraints, and state-dependent heavy-tailed gradient noise. Method: We reveal that gradient clipping in SGD implicitly performs geometric median estimation, establishing for the first time its theoretical equivalence to explicit median-based gradient estimation. We propose an iterative geometric median estimation framework that unifies the analysis of gradient clipping, DP-SGD, and related methods. Crucially, convergence is proven under heavy-tailed noise without assuming bounded or light-tailed gradients. Contribution/Results: Our work introduces the first unified theoretical framework for median-based estimation applicable across multiple robust learning scenarios. It provides strong, assumption-light convergence guarantees—requiring neither gradient boundedness nor sub-Gaussian tail conditions—and yields practical, implementable algorithms. The framework bridges theoretical rigor with broad applicability in modern robust and private distributed optimization.

Technology Category

Application Category

📝 Abstract

There are several applications of stochastic optimization where one can benefit from a robust estimate of the gradient. For example, domains such as distributed learning with corrupted nodes, the presence of large outliers in the training data, learning under privacy constraints, or even heavy-tailed noise due to the dynamics of the algorithm itself. Here we study SGD with robust gradient estimators based on estimating the median. We first consider computing the median gradient across samples, and show that the resulting method can converge even under heavy-tailed, state-dependent noise. We then derive iterative methods based on the stochastic proximal point method for computing the geometric median and generalizations thereof. Finally we propose an algorithm estimating the median gradient across iterations, and find that several well known methods - in particular different forms of clipping - are particular cases of this framework.

Problem

Research questions and friction points this paper is trying to address.

Robust gradient estimation for corrupted data and outliers

Developing stochastic proximal methods for median gradients

Convergence guarantees under heavy-tailed state-dependent noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic proximal point method for median gradient

Online median gradient estimation via clipping techniques

Convergence under heavy-tailed state-dependent noise

🔎 Similar Papers

Does SGD really happen in tiny subspaces?