One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the inefficiency and limited generalization arising from uniform learning rates in Transformer-based large language models, which ignore the structural heterogeneity across layers. Drawing on heavy-tailed self-regularization (HT-SR) theory, the authors propose LLR, a low-overhead, layer-wise learning rate method that quantifies the degree of heavy-tailedness in each layer via empirical spectral density (ESD) analysis. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate convergence, while those with stronger heavy-tailedness use smaller rates to ensure stability. LLR requires no additional hyperparameter tuning and seamlessly transfers near-optimal configurations from uniform learning rate baselines across diverse architectures, optimizers, and model scales (60M–1B parameters). Experiments demonstrate up to 1.5× faster training and an improvement in zero-shot average accuracy from 47.09% to 49.02% on models ranging from LLaMA to GPT-nano.

📝 Abstract

Learning rate configuration is a fundamental aspect of modern deep learning. The prevailing practice of applying a uniform learning rate across all layers overlooks the structural heterogeneity of Transformers, potentially limiting their effectiveness as the backbone of Large Language Models (LLMs). In this paper, we introduce Layerwise Learning Rate (LLR), an adaptive scheme that assigns distinct learning rates to individual Transformer layers. Our method is grounded in Heavy-Tailed Self-Regularization (HT-SR) theory, which characterizes the empirical spectral density (ESD) of weight correlation matrices to quantify heavy-tailedness. Layers with weaker heavy-tailedness are assigned larger learning rates to accelerate their training, while layers with stronger heavy-tailedness receive smaller learning rates. By tailoring learning rates in this manner, LLR promotes balanced training across layers, leading to faster convergence and improved generalization. Extensive experiments across architectures (from LLaMA to GPT-nano), optimizers (AdamW and Muon), and parameter scales (60M-1B) demonstrate that LLR achieves up to 1.5x training speedup and outperforms baselines, notably raising average zero-shot accuracy from 47.09% to 49.02%. A key advantage of LLR is its low tuning overhead: it transfers nearly optimal LR settings directly from the uniform baseline. Code is available at https://github.com/hed-ucas/Layer-wise-Learning-Rate.

Problem

Research questions and friction points this paper is trying to address.

learning rate

layerwise adaptation

Large Language Models

Transformer heterogeneity

training efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layerwise Learning Rate

Heavy-Tailed Self-Regularization

Empirical Spectral Density