AlphaDecay:Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs

πŸ“… 2025-06-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Conventional uniform weight decay in large language model (LLM) training ignores structural heterogeneity across modules and spectral differences in their weight distributions, leading to suboptimal regularization. Method: This paper proposes a module-level adaptive weight decay scheme, the first to incorporate heavy-tailed self-regularization theory into LLM regularization design. It dynamically assigns decay strength per module based on the degree of heaviness in the empirical spectral density (ESD) of each module’s weight correlation matrix, enabling structural-aware regularization balancing. The method comprises three components: heavy-tailed spectral analysis, ESD estimation, and modular decay scheduling. Results: Pretraining experiments across models ranging from 60M to 1B parameters demonstrate that our approach significantly reduces perplexity and improves generalization, consistently outperforming both standard uniform weight decay and existing adaptive baselines.

Technology Category

Application Category

πŸ“ Abstract
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify"heavy-tailedness."Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines.
Problem

Research questions and friction points this paper is trying to address.

Adaptive weight decay for LLM module diversity
Balance heavy-tailed spectral properties in modules
Improve perplexity and generalization in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Module-wise adaptive weight decay assignment
Heavy-Tailed Self-Regularization theory guidance
ESD-based decay strength adjustment
πŸ”Ž Similar Papers
No similar papers found.
D
Di He
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences; University of Chinese Academy of Sciences; Peng Cheng Laboratory
Ajay Jaiswal
Ajay Jaiswal
RS@Apple | Amazon Ph.D Fellow | UT Austin | IIT-KGP
Model CompressionPruningLLMsEfficient Inference
Songjun Tu
Songjun Tu
Institute of Automation, Chinese Academy of Sciences; Pengcheng Laboratory
Large Language ModelsReinforecement Learning
L
Li Shen
Shenzhen Campus of Sun Yat-sen University
Ganzhao Yuan
Ganzhao Yuan
Shenzhen University of Advanced Technology (SUAT), China
Nonlinear OptimizationMachine Learning
S
Shiwei Liu
University of Oxford
L
Lu Yin
University of Surrey