🤖 AI Summary
This work addresses the excessive memory overhead of Adam caused by per-parameter learning rates. We propose Adam-mini, a lightweight optimizer that significantly reduces memory consumption while preserving optimization performance. Our key insight—derived from Hessian structural analysis—is that over 99.9% of elements in Adam’s second-moment estimator *v* can be safely pruned without accuracy degradation. Leveraging this finding, we introduce a Hessian-driven parameter blocking strategy, assigning a single scalar learning rate per block—enabling minimal, zero-precision-loss adaptive optimization. Extensive experiments across the full training pipeline (pretraining, supervised fine-tuning, and RLHF) on language models ranging from 39M to 13B parameters demonstrate that Adam-mini matches or exceeds AdamW in convergence and final model quality, reduces memory footprint by 50%, improves training throughput by 49.6% for Llama 2-7B pretraining, and achieves a 33% reduction in wall-clock training time.
📝 Abstract
We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2 imes$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.