Convex Dominance in Deep Learning I: A Scaling Law of Loss and Learning Rate

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This work addresses the challenge of analyzing and controlling optimization dynamics in deep learning, where loss landscapes are highly non-convex. Through empirical observation, the authors find that shortly after training begins, the loss function rapidly exhibits weak convexity. Leveraging this property together with Lipschitz continuity, they derive an upper bound on the loss to guide learning rate scheduling and scaling. The study systematically uncovers, for the first time, a “convexity-dominated” regime in deep learning and establishes scaling laws for both loss and optimal learning rate across varying training durations and model sizes. Remarkably, their approach accurately extrapolates the loss and optimal learning rate over ranges spanning up to 80-fold changes in training time and 70-fold changes in model scale.

Technology Category

Application Category

📝 Abstract

Deep learning has non-convex loss landscape and its optimization dynamics is hard to analyze or control. Nevertheless, the dynamics can be empirically convex-like across various tasks, models, optimizers, hyperparameters, etc. In this work, we examine the applicability of convexity and Lipschitz continuity in deep learning, in order to precisely control the loss dynamics via the learning rate schedules. We illustrate that deep learning quickly becomes weakly convex after a short period of training, and the loss is predicable by an upper bound on the last iterate, which further informs the scaling of optimal learning rate. Through the lens of convexity, we build scaling laws of learning rates and losses that extrapolate as much as 80X across training horizons and 70X across model sizes.

Problem

Research questions and friction points this paper is trying to address.

non-convex optimization

loss dynamics

learning rate scheduling

convexity

scaling laws

Innovation

Methods, ideas, or system contributions that make the work stand out.

convex dominance

scaling law

learning rate schedule