The Effect of Mini-Batch Noise on the Implicit Bias of Adam

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work investigates how minibatch noise influences model generalization through the implicit bias of the Adam optimizer, elucidating its preference mechanism between flat and sharp minima. By integrating stochastic optimization theory, implicit bias analysis, and critical batch size theory—supported by empirical validation in the overfitting regime—the study reveals that the monotonicity of the (anti-)regularization effects induced by the Adam hyperparameters β₁ and β₂ reverses as batch size varies. Notably, the default hyperparameter settings are shown to be suitable only for small-batch training. The paper further proposes that aligning β₁ closely with β₂ during large-batch training substantially improves validation accuracy. Theoretical predictions are in strong agreement with experimental results across multiple settings.

Technology Category

Application Category

📝 Abstract

With limited high-quality data and growing compute, multi-epoch training is gaining back its importance across sub-areas of deep learning. Adam(W), versions of which are go-to optimizers for many tasks such as next token prediction, has two momentum hyperparameters $(\beta_1, \beta_2)$ controlling memory and one very important hyperparameter, batch size, controlling (in particular) the amount mini-batch noise. We introduce a theoretical framework to understand how mini-batch noise influences the implicit bias of memory in Adam (depending on $\beta_1$, $\beta_2$) towards sharper or flatter regions of the loss landscape, which is commonly observed to correlate with the generalization gap in multi-epoch training. We find that in the case of large batch sizes, higher $\beta_2$ increases the magnitude of anti-regularization by memory (hurting generalization), but as the batch size becomes smaller, the dependence of (anti-)regulariation on $\beta_2$ is reversed. A similar monotonicity shift (in the opposite direction) happens in $\beta_1$. In particular, the commonly"default"pair $(\beta_1, \beta_2) = (0.9, 0.999)$ is a good choice if batches are small; for larger batches, in many settings moving $\beta_1$ closer to $\beta_2$ is much better in terms of validation accuracy in multi-epoch training. Moreover, our theoretical derivations connect the scale of the batch size at which the shift happens to the scale of the critical batch size. We illustrate this effect in experiments with small-scale data in the about-to-overfit regime.

Problem

Research questions and friction points this paper is trying to address.

implicit bias

mini-batch noise

Adam optimizer

generalization gap

multi-epoch training

Innovation

Methods, ideas, or system contributions that make the work stand out.

implicit bias

mini-batch noise

Adam optimizer