Universal Dynamics of Warmup Stable Decay: understanding WSD beyond Transformers

📅 2026-01-13

📈 Citations: 1

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This study investigates whether the superior performance of the Warmup Stable Decay (WSD) learning rate scheduler stems from Transformer-specific architecture or reflects universal geometric properties of high-dimensional non-convex optimization landscapes. By systematically analyzing loss surfaces, sharpness dynamics, and optimization trajectories under Adam with WSD across Pythia-style language models and small CNNs trained on CIFAR-10, the work reveals a striking alignment between the two architectures in training signals, optimization paths, and sharpness evolution. These findings demonstrate that WSD’s effectiveness arises from shared geometric structures in the loss landscape rather than model-specific design choices, offering cross-architectural evidence that advances the understanding of deep learning optimization.

Technology Category

Application Category

📝 Abstract

The Warmup Stable Decay (WSD) learning rate scheduler has recently become popular, largely due to its good performance and flexibility when training large language models. It remains an open question whether the remarkable performance of WSD - using a decaying learning rate for only a fraction of training compared to cosine decay - is a phenomenon specific to transformer-based language models that can potentially offer new theoretical insights into their training dynamics. Inspired by the usage of learning rate schedulers as a new lens into understanding landscape geometry (e.g., river valley, connected minima, progressive sharpening), in this work we compare the WSD path of the Adam optimizer on a Pythia-like language model to that of a small CNN trained to classify CIFAR10 images. We observe most training signals, optimizer path features, and sharpness dynamics to be qualitatively similar in such architectures. This consistency points to shared geometric characteristics of the loss landscapes of old and new nonconvex problems, and hints to future research questions around the geometry of high dimensional optimization problems.

Problem

Research questions and friction points this paper is trying to address.

Warmup Stable Decay

loss landscape

learning rate scheduler

nonconvex optimization

neural network training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Warmup Stable Decay

loss landscape geometry

learning rate scheduler