🤖 AI Summary
Conventional deep learning optimizers update all network layers at each iteration, yet this uniform update strategy may not be optimal. Method: We propose Drop-Muon, a non-Euclidean stochastic progressive training framework that performs layer-wise customized optimization with randomized hierarchical update scheduling—updating only a subset of layers per iteration. Contribution/Results: Theoretically, we establish the first convergence guarantee for progressive training under nonsmooth and stochastic settings, proving that full-layer updates are unnecessary. Technically, Drop-Muon integrates non-Euclidean optimization, layer-specific smoothness analysis, and unifies both deterministic and stochastic gradient regimes. Empirical evaluation on CNNs demonstrates that Drop-Muon achieves up to 1.4× higher real-time throughput than the full-layer Muon optimizer while preserving model accuracy.
📝 Abstract
Conventional wisdom in deep learning optimization dictates updating all layers at every step-a principle followed by all recent state-of-the-art optimizers such as Muon. In this work, we challenge this assumption, showing that full-network updates can be fundamentally suboptimal, both in theory and in practice. We introduce a non-Euclidean Randomized Progressive Training method-Drop-Muon-a simple yet powerful framework that updates only a subset of layers per step according to a randomized schedule, combining the efficiency of progressive training with layer-specific non-Euclidean updates for top-tier performance. We provide rigorous convergence guarantees under both layer-wise smoothness and layer-wise $(L^0, L^1)$-smoothness, covering deterministic and stochastic gradient settings, marking the first such results for progressive training in the stochastic and non-smooth regime. Our cost analysis further reveals that full-network updates are not optimal unless a very specific relationship between layer smoothness constants holds. Through controlled CNN experiments, we empirically demonstrate that Drop-Muon consistently outperforms full-network Muon, achieving the same accuracy up to $1.4 imes$ faster in wall-clock time. Together, our results suggest a shift in how large-scale models can be efficiently trained, challenging the status quo and offering a highly efficient, theoretically grounded alternative to full-network updates.