Implicit Bias of SignGD and Adam on Multiclass Separable Data

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Understanding the implicit bias of adaptive and sign-based optimizers for linear classification on multi-class separable data remains open, especially beyond SGD and binary classification. Method: This paper provides the first rigorous characterization of the convergence behavior of Adam and sign gradient descent (SignGD) in minimizing multi-class cross-entropy loss. Contribution/Results: We prove that both algorithms globally converge to solutions maximizing the max-norm margin $|W|_{max}$—i.e., the largest entry-wise absolute value of the classifier matrix—and derive explicit convergence rates. Our analysis unifies these dynamics under a $p$-norm normalized steepest descent framework, generalizable to broad loss families. Unlike prior work restricted to SGD or binary settings, this is the first implicit bias theory for adaptive and sign-based optimizers in the multi-class regime. It reveals their fundamental inductive bias toward max-norm margin maximization, offering new insights into the generalization mechanisms of modern deep learning optimizers.

Technology Category

Application Category

📝 Abstract

In the optimization of overparameterized models, different gradient-based methods can achieve zero training error yet converge to distinctly different solutions inducing different generalization properties. While a decade of research on implicit optimization bias has illuminated this phenomenon in various settings, even the foundational case of linear classification with separable data still has important open questions. We resolve a fundamental gap by characterizing the implicit bias of both Adam and Sign Gradient Descent in multi-class cross-entropy minimization: we prove that their iterates converge to solutions that maximize the margin with respect to the classifier matrix's max-norm and characterize the rate of convergence. We extend our results to general p-norm normalized steepest descent algorithms and to other multi-class losses.

Problem

Research questions and friction points this paper is trying to address.

Characterize implicit bias of Adam

Analyze Sign Gradient Descent convergence

Maximize margin in multi-class classification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Characterizes implicit bias

Maximizes margin with max-norm

Extends to p-norm algorithms

🔎 Similar Papers

No similar papers found.