🤖 AI Summary
This work addresses the challenges of GPU memory constraints in full-parameter training of large language models and the slow convergence of existing memory-efficient methods under non-convex optimization. To this end, the authors propose OMGD, a lightweight optimizer that reduces memory consumption through a dynamic masking traversal mechanism. Notably, OMGD achieves a non-convex convergence iteration complexity of ~O(ε⁻³) within a memory-efficient training framework—the first to surpass the prevailing O(ε⁻⁴) bound. Designed with a plug-and-play architecture, OMGD seamlessly integrates with mainstream optimizers and leverages gradient masking schedules grounded in non-convex optimization theory. Experimental results demonstrate its consistent superiority over baseline methods in both pretraining and fine-tuning tasks, effectively balancing training efficiency and convergence performance.
📝 Abstract
Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(\epsilon^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.