AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates

📅 2025-01-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address inefficient and unstable parameter updates in large language model training, this paper proposes the Layer-wise Asynchronous Masking Adaptive (LAMA) optimization framework. LAMA introduces a fine-grained, layer-intra asynchronous parameter masking mechanism—novelly leveraging historical momentum and gradient direction consistency—to enable selective, asynchronous updates within each layer. It further incorporates a dynamic α-strength control strategy that adaptively modulates update step sizes without increasing GPU memory overhead. Designed as a plug-in module, LAMA is fully compatible with mainstream optimizers such as Momentum and AdamW, and provides theoretical convergence guarantees under standard assumptions. Extensive experiments across GPT-2, RoBERTa, and Llama-7B on multiple downstream tasks demonstrate that LAMA accelerates convergence by up to 32% over AdamW, significantly improves computational efficiency, and maintains robust performance stability.

Technology Category

Application Category

📝 Abstract
In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this href{https://github.com/MaeChd/AlphaAdam}{link}
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Training
Optimization Methods
Parameter Update Stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

AlphaAdam
Hierarchical Parameter Update
Adaptive Strategy
D
Da Chang
Peng Cheng Laboratory, Shenzhen, China; Shenzhen Institutes of Advanced Technology, Shenzhen, China
Y
Yu Li
Wuhan University, Hongyi College, Wuhan, China
Ganzhao Yuan
Ganzhao Yuan
Shenzhen University of Advanced Technology (SUAT), China
Nonlinear OptimizationMachine Learning