🤖 AI Summary
This work addresses the challenge that large language models (LLMs) exhibit heterogeneous modules with uneven gradient noise distributions, causing conventional adaptive optimizers to struggle in balancing the optimization needs across modules and leading to slow convergence and training instability. To overcome this, the authors propose Module-level Learning Rate Scaling (MoLS), a novel optimizer design that incorporates module-wise signal-to-noise ratios (SNRs) for the first time. MoLS dynamically scales learning rates based on estimated gradient SNRs per module, enabling fully adaptive optimization without manual hyperparameter tuning. The method seamlessly integrates with Adam correction and memory-efficient training protocols. Experiments across multiple LLM benchmarks demonstrate that MoLS significantly accelerates convergence and improves generalization, matching or surpassing the performance of carefully hand-tuned, module-specific learning rate strategies.
📝 Abstract
The impressive performance of large language models (LLMs) arises from their massive scale and heterogeneous module composition. However, this structural heterogeneity introduces additional optimization challenges. While adaptive optimizers such as Adam(W) provide per-parameter adaptivity, they do not explicitly account for module-level gradient heterogeneity, resulting in slower convergence, suboptimal performance, or training instability. Existing approaches typically rely on manually tuned module-specific learning rates or specific optimization strategies, which are computationally costly and difficult to generalize across tasks or models. To establish a more principled approach, we first analyze the noise-damping behavior of Adam in high-noise modules and introduce \textbf{Module-wise Learning Rate Scaling via SNR (MoLS)}. MoLS estimates module-level SNRs to scale Adam updates, allowing automated module-wise learning rate allocation without manual tuning. Empirical results through multiple LLM training benchmarks demonstrate that MoLS improves convergence speed and generalization, achieving performance comparable to carefully tuned module-specific learning rates, while remaining compatible with memory-efficient training algorithms.