Rethinking Neural Network Learning Rates: A Stackelberg Perspective

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the limitations of uniform learning rates in conventional neural network training, which hinder the full exploitation of each layer’s optimization potential and lack theoretical justification for non-uniform alternatives. By reframing the training objective through the lens of Stackelberg game theory, the authors model layer-wise learning rate heterogeneity as a two-timescale alternating gradient descent process. They establish, for the first time, a theoretical connection between non-uniform learning rates and Stackelberg optimization, revealing two key mechanisms that accelerate convergence: an enhanced optimization structure and steeper local curvature. The proposed framework enables finite-time convergence analysis under constrained parameter sets and non-smooth activation functions, and demonstrates consistently superior performance over uniform learning rate methods in both supervised and reinforcement learning tasks, achieving faster convergence and improved final performance.

📝 Abstract

Neural networks are typically trained with a single learning rate across all layers. While recent empirical evidence suggests that assigning layer-specific learning rates can accelerate training, a principled understanding of the conditions and mechanisms under which non-uniform learning rates are beneficial remains limited. In this work, we investigate non-uniform learning rates through the lens of Stackelberg optimization. Specifically, we demonstrate that training neural networks with a smaller learning rate for the body layers and a larger learning rate for the final layer can be interpreted as a two-time-scale alternating gradient descent algorithm applied to a Stackelberg reformulation of the original objective. We establish finite-time convergence guarantees for the algorithm under broad conditions that accommodate constraint sets and non-smooth activation functions. Beyond convergence, we identify two mechanisms by which non-uniform learning rates can outperform uniform learning rates: (i) we show that certain problem instances induce a Stackelberg objective with stronger optimization structure than the original objective, yielding faster convergence to globally optimal solutions, (ii) our numerical analysis reveals that the Stackelberg objective can exhibit substantially sharper local curvature, especially in early training, which leads to more informative gradients and learning acceleration. Experiments in supervised learning and reinforcement learning support our findings.

Problem

Research questions and friction points this paper is trying to address.

learning rates

neural networks

Stackelberg optimization

non-uniform learning

convergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stackelberg optimization

non-uniform learning rates

two-time-scale gradient descent