When Can You Get Away with Low Memory Adam?

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Adam’s memory overhead is prohibitively high due to storage of per-parameter second-moment estimates (exponentially weighted averages of squared gradients). Method: This work introduces hierarchical signal-to-noise ratio (SNR) analysis—a novel framework for optimizer design—and proposes an adaptive compression mechanism that dynamically determines compression timing and dimensionality reduction based on layer-wise SNR of second moments. We instantiate this as SlimAdam, a low-memory Adam variant implemented atop PyTorch’s optimizer extension framework, incorporating layer-wise SNR evaluation, dynamic tensor dimensionality reduction, and adaptive second-moment approximation. Contribution/Results: SlimAdam is the first Adam variant that strictly preserves Adam’s convergence guarantees and numerical accuracy while reducing second-moment memory consumption by up to 98%. Extensive experiments across diverse architectures (e.g., ViT, ResNet, LSTM) and tasks (e.g., image classification, language modeling) confirm full fidelity to Adam’s performance. The implementation is open-sourced.

Technology Category

Application Category

📝 Abstract
Adam is the go-to optimizer for training modern machine learning models, but it requires additional memory to maintain the moving averages of the gradients and their squares. While various low-memory optimizers have been proposed that sometimes match the performance of Adam, their lack of reliability has left Adam as the default choice. In this work, we apply a simple layer-wise Signal-to-Noise Ratio (SNR) analysis to quantify when second-moment tensors can be effectively replaced by their means across different dimensions. Our SNR analysis reveals how architecture, training hyperparameters, and dataset properties impact compressibility along Adam's trajectory, naturally leading to $ extit{SlimAdam}$, a memory-efficient Adam variant. $ extit{SlimAdam}$ compresses the second moments along dimensions with high SNR when feasible, and leaves when compression would be detrimental. Through experiments across a diverse set of architectures and training scenarios, we show that $ extit{SlimAdam}$ matches Adam's performance and stability while saving up to $98%$ of total second moments. Code for $ extit{SlimAdam}$ is available at https://github.com/dayal-kalra/low-memory-adam.
Problem

Research questions and friction points this paper is trying to address.

Reducing Adam's high memory usage for gradient moment storage.
Determining when second-moment tensors can be dimensionally compressed.
Maintaining Adam's performance while achieving 98% memory reduction.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise SNR analysis for second-moment tensor replacement
Dimension-selective compression based on SNR thresholds
Dynamic memory optimization preserving Adam's stability
🔎 Similar Papers
No similar papers found.
D
Dayal Singh Kalra
Department of Computer Science, University of Maryland, College Park
John Kirchenbauer
John Kirchenbauer
University of Maryland, College Park
Machine LearningNatural Language ProcessingLarge Language ModelsML Security
M
M. Barkeshli
Department of Physics, University of Maryland, College Park, Joint Quantum Institute, University of Maryland, College Park
Tom Goldstein
Tom Goldstein
Volpi-Cupal Professor of Computer Science, University of Maryland
Numerical OptimizationMachine LearningDistributed ComputingComputer Vision