🤖 AI Summary
This work addresses the limitation of existing optimizers, which rely on fixed geometric constraints that often fail to align with the intrinsic geometry of individual layers in deep neural networks. The authors propose a data-driven adaptive optimization framework that dynamically selects the optimal update geometry for each layer based on gradient and activation statistics, unifying optimizers such as SGD, Muon, and Adam as special cases. Their approach is the first to efficiently adapt the geometry of the linear minimization oracle (LMO) at runtime by integrating Schatten-p norms, a single-step stochastic feature regression surrogate model, and parameter-level preconditioning, incurring only approximately 3% additional computational overhead. Empirical evaluations across three training scenarios demonstrate that the method matches or surpasses the performance of state-of-the-art optimizers like Muon and AdamW, confirming the efficacy and scalability of adaptive geometric optimization.
📝 Abstract
Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.