Dual Space Preconditioning for Gradient Descent in the Overparameterized Regime

๐Ÿ“… 2026-03-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the convergence and implicit bias of preconditioned gradient descent in over-parameterized linear models, focusing on a class of preconditioners defined by gradients of convex functionsโ€”such as normalized gradient descent, gradient clipping, and Adam. The authors introduce a novel analytical framework based on Bregman divergences to establish global convergence of the iterates to an interpolating solution. They show that under isotropic preconditioning, the limit point coincides with that of standard gradient descent, namely the minimum Frobenius norm solution. Furthermore, in the general case, they derive norm-based bounds relating the preconditioned solution to the gradient descent solution, thereby characterizing the implicit regularization effect induced by such preconditioning schemes.

Technology Category

Application Category

๐Ÿ“ Abstract
In this work we study the convergence properties of the Dual Space Preconditioned Gradient Descent, encompassing optimizers such as Normalized Gradient Descent, Gradient Clipping and Adam. We consider preconditioners of the form $\nabla K$, where $K: \mathbb{R}^p \to \mathbb{R}$ is convex and assume that the latter is applied to train an over-parameterized linear model with loss of the form $\ell({X} {W} - {Y})$, for weights ${W} \in \mathbb{R}^{d \times k}$, labels ${Y} \in \mathbb{R}^{n \times k}$ and data ${X} \in \mathbb{R}^{n \times d}$. Under the aforementioned assumptions, we prove that the iterates of the preconditioned gradient descent always converge to a point ${W}_{\infty} \in \mathbb{R}^{d \times k}$ satisfying ${X}{W}_{\infty} = {Y}$. Our proof techniques are of independent interest as we introduce a novel version of the Bregman Divergence with accompanying identities that allow us to establish convergence. We also study the implicit bias of Dual Space Preconditioned Gradient Descent. First, we demonstrate empirically that, for general $K(\cdot)$, ${W}_\infty$ depends on the chosen learning rate, hindering a precise characterization of the implicit bias. Then, for preconditioners of the form $K({G}) = h(\|{G}\|_F)$, known as \textit{isotropic preconditioners}, we show that ${W}_\infty$ minimizes $\|{W}_\infty - {W}_0\|_F^2$ subject to ${X}{W}_\infty = {Y}$, where ${W}_0$ is the initialization. Denoting the convergence point of GD initialized at ${W}_0$ by ${W}_{\text{GD}, \infty}$, we thus note ${W}_{\infty} = {W}_{\text{GD}, \infty}$ for isotropic preconditioners. Finally, we show that a similar fact holds for general preconditioners up to a multiplicative constant, namely, $\|{W}_0 - {W}_{\infty}\|_F \le c \|{W}_0 - {W}_{\text{GD}, \infty}\|_F$ for a constant $c>0$.
Problem

Research questions and friction points this paper is trying to address.

overparameterized regime
preconditioned gradient descent
implicit bias
convergence
dual space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual Space Preconditioning
Overparameterized Regime
Bregman Divergence
Implicit Bias
Isotropic Preconditioners
๐Ÿ”Ž Similar Papers
No similar papers found.