Optimizer-Induced Mode Connectivity: From AdamW to Muon

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

226K/year
🤖 AI Summary
This study investigates how optimizers shape the connectivity structure of solution spaces in neural networks. Combining theoretical analysis of wide two-layer ReLU networks with empirical pretraining experiments on GPT-2, the authors employ spectral analysis and path connectivity probing to demonstrate— for the first time—that optimizers such as AdamW and Muon induce solution-set structures independent of conventional mode connectivity. Theoretically, solutions obtained by the same optimizer are provably connected in the infinite-width limit. Experimentally, in finite-width settings, different optimizers converge to distinct zero-loss regions separated by loss barriers. In GPT-2, paths between solutions from the same optimizer exhibit spectral consistency, whereas cross-optimizer paths show smooth transitions, collectively indicating that optimizers fundamentally influence the geometric landscape of the solution space.
📝 Abstract
Mode connectivity has been widely studied, yet the role of the optimizer remains underexplored. We revisit it through optimizer-induced implicit regularization, asking how connectivity behaves when restricted to solutions constrained by a given optimizer. For two-layer ReLU networks, we show that solutions from a single optimizer -- AdamW, Muon, or others in the Lion-$\mathcal{K}$ family -- form a connected set at sufficiently large width, a result not implied by prior work. We then characterize how optimizer-induced regions interact: at large width two different regions can be disjoint or overlap depending on regularization, while in our small-width example AdamW and Muon converge to disconnected zero-loss components separated by a provable loss barrier. Empirically, in GPT-2 pretraining, we observe same-optimizer paths preserve each model's spectrum while cross-optimizer paths traverse a smooth transition. Our results reveal optimizer-dependent structure beyond classical mode connectivity literature.
Problem

Research questions and friction points this paper is trying to address.

mode connectivity
optimizer
implicit regularization
loss landscape
solution space
Innovation

Methods, ideas, or system contributions that make the work stand out.

optimizer-induced regularization
mode connectivity
loss landscape
implicit bias
wide neural networks