Adaptive Batch Sizes Using Non-Euclidean Gradient Noise Scales for Stochastic Sign and Spectral Descent

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the inefficiency of existing adaptive batch size methods, which rely on Euclidean assumptions and fail to accommodate non-Euclidean optimizers such as signSGD and specSGD that operate under norms like ℓ∞ and S∞. To overcome this limitation, the paper introduces, for the first time, a non-Euclidean gradient noise scale aligned with the dual norm geometry inherent to these optimizers. It further develops a local variance estimation algorithm that efficiently approximates this noise scale in distributed data-parallel settings, enabling geometry-aware adaptive batch size scheduling. Empirical evaluation on a 160-million-parameter Llama model demonstrates that, when combined with Signum and Muon optimizers, the proposed method reduces training steps by up to 66% while achieving the same validation loss as baseline approaches.

Technology Category

Application Category

📝 Abstract
To maximize hardware utilization, modern machine learning systems typically employ large constant or manually tuned batch size schedules, relying on heuristics that are brittle and costly to tune. Existing adaptive strategies based on gradient noise scale (GNS) offer a principled alternative. However, their assumption of SGD's Euclidean geometry creates a fundamental mismatch with popular optimizers based on generalized norms, such as signSGD / Signum ($\ell_\infty$) and stochastic spectral descent (specSGD) / Muon ($\mathcal{S}_\infty$). In this work, we derive gradient noise scales for signSGD and specSGD that naturally emerge from the geometry of their respective dual norms. To practically estimate these non-Euclidean metrics, we propose an efficient variance estimation procedure that leverages the local mini-batch gradients on different ranks in distributed data-parallel systems. Our experiments demonstrate that adaptive batch size strategies using non-Euclidean GNS enable us to match the validation loss of constant-batch baselines while reducing training steps by up to 66% for Signum and Muon on a 160 million parameter Llama model.
Problem

Research questions and friction points this paper is trying to address.

adaptive batch size
gradient noise scale
non-Euclidean geometry
signSGD
spectral descent
Innovation

Methods, ideas, or system contributions that make the work stand out.

non-Euclidean gradient noise scale
adaptive batch size
signSGD
spectral descent
distributed variance estimation
🔎 Similar Papers
No similar papers found.
H
Hiroki Naganuma
Mila, Montreal, Canada; Université de Montréal, Montreal, Canada
S
Shagun Gupta
Meta Platforms, Menlo Park, California, USA
Y
Youssef Briki
Université de Montréal, Montreal, Canada
I
I. Mitliagkas
Mila, Montreal, Canada; Université de Montréal, Montreal, Canada
Irina Rish
Irina Rish
University of Montreal / Mila -Quebec AI Institute
Artificial IntelligenceMachine LearningNeuroscience
Parameswaran Raman
Parameswaran Raman
Research Scientist, Meta
Machine LearningOptimization AlgorithmsLarge Language ModelsRankingDistributed Training
Hao-Jun Michael Shi
Hao-Jun Michael Shi
Research Scientist, Meta
Numerical OptimizationMathematical SoftwareDeep LearningScientific Computing