Decoupled Relative Learning Rate Schedules

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Conventional Transformer training employs a uniform learning rate across all model components, ignoring their dynamic heterogeneity in parameter sensitivity and update magnitude—leading to suboptimal optimization efficiency. Method: We propose a dynamic decoupled learning rate scheduling framework, introducing— for the first time—the concept of *relative learning rates*, which adaptively scale per-component learning rates based on layer-specific gradient statistics and architectural roles. Our approach is architecture-agnostic within the Transformer family and integrates seamlessly with Mixture of Experts (MoE) configurations. Contribution/Results: The method enables direct hyperparameter transfer across model scales—from small baselines to models 27× larger—without manual retuning. Empirical evaluation demonstrates up to 23% faster convergence for complex models and substantial reductions in computational resource consumption. This work establishes a scalable, efficient, and broadly generalizable optimization paradigm for large-scale neural networks.

Technology Category

Application Category

📝 Abstract

In this work, we introduce a novel approach for optimizing LLM training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced relative learning rates, RLRS, method accelerates the training process by up to $23%$, particularly in complex models such as Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then effectively reused on models up to $27 imes$ larger. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

Problem

Research questions and friction points this paper is trying to address.

Optimizes LLM training with component-specific learning rates

Accelerates training by up to 23% in complex models

Reduces training time and computational resources effectively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adjusts learning rates across Transformer components

Uses relative learning rates (RLRS) for efficiency

Scales hyperparameters from small to large models

🔎 Similar Papers

No similar papers found.