🤖 AI Summary
To address the prohibitive memory overhead of full-parameter fine-tuning for large language models (LLMs), this paper proposes Momentum Low-Rank Compression (MLorc), a novel training paradigm that directly applies low-rank compression to optimizer momentum—rather than gradients or weight updates—for the first time. MLorc introduces a dynamic rank selection mechanism to preserve the update dynamics of full fine-tuning and provides theoretical convergence guarantees under standard assumptions. It is compatible with generic optimizers such as SGD and Adam. Experiments demonstrate that, at rank $r = 4$, MLorc matches or surpasses full-parameter fine-tuning across multiple LLMs and downstream tasks, while achieving memory and computational efficiency comparable to LoRA and GaLore. Its strong generalization across architectures and tasks underscores its robustness. The core innovation lies in adaptive low-rank compression applied explicitly to momentum, effectively balancing training efficiency and optimization fidelity.
📝 Abstract
With increasing size of large language models (LLMs), full-parameter fine-tuning imposes substantial memory demands. To alleviate this, we propose a novel memory-efficient training paradigm called Momentum Low-rank compression (MLorc). By directly compressing and reconstructing momentum rather than gradients, MLorc avoids imposing a fixed-rank constraint on weight update matrices and better preserves the training dynamics of full-parameter fine-tuning, in contrast to existing low-rank approaches such as LoRA and GaLore. Empirically, MLorc consistently outperforms other memory-efficient training methods, matches or even exceeds the performance of full fine-tuning with a small rank (e.g., $r=4$), and generalizes well across different optimizers -- all while not compromising time or memory efficiency. Furthermore, we provide a theoretical guarantee for its convergence under reasonable assumptions.