AlphaAdam:Asynchronous Masked Optimization with Dynamic Alpha for Selective Updates

📅 2025-01-30

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address inefficient and unstable parameter updates in large language model training, this paper proposes the Layer-wise Asynchronous Masking Adaptive (LAMA) optimization framework. LAMA introduces a fine-grained, layer-intra asynchronous parameter masking mechanism—novelly leveraging historical momentum and gradient direction consistency—to enable selective, asynchronous updates within each layer. It further incorporates a dynamic α-strength control strategy that adaptively modulates update step sizes without increasing GPU memory overhead. Designed as a plug-in module, LAMA is fully compatible with mainstream optimizers such as Momentum and AdamW, and provides theoretical convergence guarantees under standard assumptions. Extensive experiments across GPT-2, RoBERTa, and Llama-7B on multiple downstream tasks demonstrate that LAMA accelerates convergence by up to 32% over AdamW, significantly improves computational efficiency, and maintains robust performance stability.

Technology Category

Application Category

📝 Abstract

In the training of large language models (LLMs), updating parameters more efficiently and stably has always been an important challenge. To achieve efficient parameter updates, existing methods usually achieve performance comparable to full parameter updates through methods such as low-dimensional decomposition or layer-wise selective updates. In this work, we propose AlphaAdam, an optimization framework for LLM from the perspective of intra-layer parameter updates. By decoupling parameter updates and dynamically adjusting their strength, AlphaAdam accelerates convergence and improves training stability. We construct parameter masks based on the consistency of historical momentum and gradient direction and combine them with an adaptive mask strength strategy to ensure efficient optimization and theoretical convergence guarantees, which is also applicable to most momentum-based optimizers. Extensive experiments show that AlphaAdam outperforms state-of-the-art methods such as AdamW in terms of convergence speed and computational efficiency across tasks, including GPT-2 pre-trained and fine-tuned RoBERTa and Llama-7B. Our AlphaAdam implements an optimizer enhancement framework for LLMs through intra-layer asynchronous masked adaptive updates. Our code is available in this href{https://github.com/MaeChd/AlphaAdam}{link}

Problem

Research questions and friction points this paper is trying to address.

Large Language Model Training

Optimization Methods

Parameter Update Stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

AlphaAdam

Hierarchical Parameter Update

Adaptive Strategy

🔎 Similar Papers

Downstream Task Guided Masking Learning in Masked Autoencoders Using Multi-Level Optimization