Schedulers for Schedule-free: Theoretically inspired hyperparameters

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses theoretical limitations of schedule-free optimization methods by proposing a unified analytical framework compatible with arbitrary learning rate schedules. Methodologically, it extends classical last-iterate convergence theory—previously restricted to fixed or decaying learning rates—to general schedules (e.g., warm-up–stable–decay), introduces a dynamic averaging mechanism for model parameters that rigorously adapts to time-varying learning rates, and designs an adaptive Polyak step-size rule achieving optimal anytime convergence rate $O(1/sqrt{T})$. The theoretical analysis is rigorously established under convexity assumptions. Empirically, the method significantly outperforms SGD, Adam, and existing schedule-free baselines on black-box model distillation tasks, demonstrating strong predictive power of the theory for practical training performance.

Technology Category

Application Category

📝 Abstract

The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $mathcal{O}(1/sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.

Problem

Research questions and friction points this paper is trying to address.

Extending schedule-free theory to support arbitrary learning rate schedulers

Developing adaptive Polyak learning rate for optimal anytime convergence

Validating theoretical convergence predictions through deep network experiments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends schedule-free theory to support any scheduler

Updates averaging parameter based on learning rate

Introduces adaptive Polyak schedule for optimal convergence

🔎 Similar Papers

PISA: An Adversarial Approach To Comparing Task Graph Scheduling Algorithms