Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This work investigates the implicit bias of Sharpness-Aware Minimization (SAM) in deep linear networks under binary classification, with a focus on contrasting its convergence direction and training dynamics against those of standard gradient descent (GD). Leveraging theoretical analysis and experiments within the framework of deep linear diagonal networks, linearly separable data, and gradient flow dynamics, the study demonstrates that network depth plays a decisive role in shaping SAM’s implicit bias. A key finding is the identification of a “sequential feature amplification” phenomenon in ℓ₂-SAM during finite-time training—wherein secondary features are initially exploited before transitioning to dominant ones. For depth L=2, the asymptotic solution of SAM is shown to depend heavily on initialization and, despite converging to the ℓ₁ max-margin solution in the limit, exhibits a dynamic feature selection mechanism markedly distinct from GD, underscoring the insufficiency of infinite-time analyses alone to characterize SAM’s behavior.

Technology Category

Application Category

📝 Abstract

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call"sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

Problem

Research questions and friction points this paper is trying to address.

Sharpness-Aware Minimization

implicit bias

linear diagonal networks

depth

feature amplification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sharpness-Aware Minimization

implicit bias

sequential feature amplification