Training Dynamics of the Cooldown Stage in Warmup-Stable-Decay Learning Rate Scheduler

📅 2025-08-02

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the poorly understood cooldown phase in Warmup-Stable-Decay (WSD) learning rate schedulers and systematically investigates its impact on Transformer training dynamics and generalization. Through loss landscape visualization, sensitivity analysis of AdamW hyperparameters, and controlled experiments on learning rate shape, we identify a critical bias–variance trade-off during cooldown: excessively rapid decay amplifies bias, while overly slow decay increases variance. To balance exploration and exploitation, we propose an adaptive cooldown strategy and demonstrate that high β₂ values (e.g., 0.9999) significantly improve gradient stability and convergence quality of AdamW during this phase. Empirical evaluation shows that our approach—combined with principled cooldown shape selection—yields substantial gains in model accuracy and robustness, matching the performance improvement achieved via fine-grained hyperparameter tuning. The method provides an interpretable, practical, and easily deployable configuration paradigm for WSD schedulers.

Technology Category

Application Category

📝 Abstract

Learning rate scheduling is essential in transformer training, where the final annealing plays a crucial role in getting the best performance. However, the mechanisms behind this cooldown phase, with its characteristic drop in loss, remain poorly understood. To address this, we provide a comprehensive analysis focusing solely on the cooldown phase in the Warmup-Stable-Decay (WSD) learning rate scheduler. Our analysis reveals that different cooldown shapes reveal a fundamental bias-variance trade-off in the resulting models, with shapes that balance exploration and exploitation consistently outperforming alternatives. Similarly, we find substantial performance variations $unicode{x2013}$ comparable to those from cooldown shape selection $unicode{x2013}$ when tuning AdamW hyperparameters. Notably, we observe consistent improvements with higher values of $β_2$ during cooldown. From a loss landscape perspective, we provide visualizations of the landscape during cooldown, supporting the river valley loss perspective empirically. These findings offer practical recommendations for configuring the WSD scheduler in transformer training, emphasizing the importance of optimizing the cooldown phase alongside traditional hyperparameter tuning.

Problem

Research questions and friction points this paper is trying to address.

Understanding mechanisms behind cooldown phase in WSD scheduler

Analyzing bias-variance trade-off during cooldown in transformer training

Optimizing AdamW hyperparameters and cooldown shapes for performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes cooldown phase in WSD scheduler

Explores bias-variance trade-off in models

Recommends higher β₂ values during cooldown

🔎 Similar Papers

Spike No More: Stabilizing the Pre-training of Large Language Models