Omni-Masked Gradient Descent: Memory-Efficient Optimization via Mask Traversal with Improved Convergence

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the challenges of GPU memory constraints in full-parameter training of large language models and the slow convergence of existing memory-efficient methods under non-convex optimization. To this end, the authors propose OMGD, a lightweight optimizer that reduces memory consumption through a dynamic masking traversal mechanism. Notably, OMGD achieves a non-convex convergence iteration complexity of ~O(ε⁻³) within a memory-efficient training framework—the first to surpass the prevailing O(ε⁻⁴) bound. Designed with a plug-and-play architecture, OMGD seamlessly integrates with mainstream optimizers and leverages gradient masking schedules grounded in non-convex optimization theory. Experimental results demonstrate its consistent superiority over baseline methods in both pretraining and fine-tuning tasks, effectively balancing training efficiency and convergence performance.

Technology Category

Application Category

📝 Abstract

Memory-efficient optimization methods have recently gained increasing attention for scaling full-parameter training of large language models under the GPU-memory bottleneck. Existing approaches either lack clear convergence guarantees, or only achieve the standard ${\mathcal{O}}(\epsilon^{-4})$ iteration complexity in the nonconvex settings. We propose Omni-Masked Gradient Descent (OMGD), an optimization method based on mask traversal for memory efficient training, and provide a nonconvex convergence analysis that establishes a strictly improved iteration complexity of $\tilde{\mathcal{O}}(\epsilon^{-3})$ for finding an $\epsilon$-approximate stationary point. Empirically, OMGD is a lightweight, plug-and-play approach that integrates seamlessly into most mainstream optimizers, yielding consistent improvements over competitive baselines in both fine-tuning and pre-training tasks.

Problem

Research questions and friction points this paper is trying to address.

memory-efficient optimization

large language models

nonconvex convergence

iteration complexity

GPU-memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

memory-efficient optimization

mask traversal

nonconvex convergence