LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

📅 2024-10-21
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address the high optimizer memory overhead in large language model (LLM) training and the challenge of balancing full-parameter-space exploration with efficient low-dimensional updates, this paper proposes LDAdam. Methodologically, LDAdam introduces: (1) a novel projection-aware state update rule that jointly models compression errors in gradients and optimizer states; (2) a generalized error-feedback mechanism coupled with dynamic subspace switching to ensure persistent exploration across the full parameter space; and (3) subspace-adaptive estimation of momentum and second-order statistics via low-rank gradient projections. Theoretically, we establish convergence guarantees under standard assumptions. Empirically, LDAdam achieves accuracy comparable to AdamW on both LLM pretraining and fine-tuning tasks, while reducing optimizer memory consumption to less than 1% of the original parameter count.

Technology Category

Application Category

📝 Abstract
We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer's memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and show that LDAdam allows for accurate and efficient fine-tuning and pre-training of language models. Code is available at https://github.com/IST-DASLab/LDAdam
Problem

Research questions and friction points this paper is trying to address.

Memory-efficient optimizer for large models
Adaptive optimization in low-dimensional subspaces
Accurate fine-tuning and pre-training of language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Memory-efficient optimizer for large models
Adaptive optimization in low-dimensional subspaces
Generalized error feedback for gradient compression
🔎 Similar Papers
No similar papers found.