🤖 AI Summary
Sharpness-Aware Minimization (SAM) improves generalization by seeking flat minima, yet its precise mechanism—particularly in late-stage training—and its implicit bias toward flat solutions relative to SGD remain poorly understood.
Method: The authors theoretically characterize SAM’s late-phase optimization dynamics as a two-stage process: rapid escape from SGD-converged points followed by swift convergence to flatter minima within the same loss valley. They propose that late-phase optimization predominantly determines the final solution’s properties.
Contribution/Results: Empirically, activating SAM only in the final 3–5 training epochs achieves comparable or superior generalization to full-training SAM on CIFAR-10/100—matching test accuracy (with gains of 0.8%–1.2%) and reducing loss landscape sharpness by ≈40%. This late-SAM strategy generalizes effectively to adversarial training, significantly enhancing model robustness. The work thus establishes both a theoretical foundation for SAM’s flatness bias and a highly efficient practical protocol for leveraging it.
📝 Abstract
Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding light on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find that SAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly converges to a flatter minimum within the same valley. Furthermore, we empirically investigate the role of SAM during the early training phase. We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution's properties. Based on this viewpoint, we extend our findings from SAM to Adversarial Training.