Adam-mini: Use Fewer Learning Rates To Gain More

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 21

✨ Influential: 3

career value

186K/year

🤖 AI Summary

This work addresses the excessive memory overhead of Adam caused by per-parameter learning rates. We propose Adam-mini, a lightweight optimizer that significantly reduces memory consumption while preserving optimization performance. Our key insight—derived from Hessian structural analysis—is that over 99.9% of elements in Adam’s second-moment estimator *v* can be safely pruned without accuracy degradation. Leveraging this finding, we introduce a Hessian-driven parameter blocking strategy, assigning a single scalar learning rate per block—enabling minimal, zero-precision-loss adaptive optimization. Extensive experiments across the full training pipeline (pretraining, supervised fine-tuning, and RLHF) on language models ranging from 39M to 13B parameters demonstrate that Adam-mini matches or exceeds AdamW in convergence and final model quality, reduces memory footprint by 50%, improves training throughput by 49.6% for Llama 2-7B pretraining, and achieves a 33% reduction in wall-clock training time.

Technology Category

Application Category

📝 Abstract

We propose Adam-mini, an optimizer that achieves on par or better performance than AdamW with 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/sqrt{v}$). By investigating the Hessian structure of neural nets, we find Adam's $v$ might not function at its full potential as effectively as we expected. We find that $geq$ 99.9% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our new principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on $2 imes$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

Problem

Research questions and friction points this paper is trying to address.

Optimizer memory efficiency

Learning rate reduction

Neural network performance enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces learning rate resources

Partitions parameters into blocks

Assigns single learning rate per block

🔎 Similar Papers

No similar papers found.