🤖 AI Summary
This work addresses theoretical limitations of schedule-free optimization methods by proposing a unified analytical framework compatible with arbitrary learning rate schedules. Methodologically, it extends classical last-iterate convergence theory—previously restricted to fixed or decaying learning rates—to general schedules (e.g., warm-up–stable–decay), introduces a dynamic averaging mechanism for model parameters that rigorously adapts to time-varying learning rates, and designs an adaptive Polyak step-size rule achieving optimal anytime convergence rate $O(1/sqrt{T})$. The theoretical analysis is rigorously established under convexity assumptions. Empirically, the method significantly outperforms SGD, Adam, and existing schedule-free baselines on black-box model distillation tasks, demonstrating strong predictive power of the theory for practical training performance.
📝 Abstract
The recently proposed schedule-free method has been shown to achieve strong performance when hyperparameter tuning is limited. The current theory for schedule-free only supports a constant learning rate, where-as the implementation used in practice uses a warm-up schedule. We show how to extend the last-iterate convergence theory of schedule-free to allow for any scheduler, and how the averaging parameter has to be updated as a function of the learning rate. We then perform experiments showing how our convergence theory has some predictive power with regards to practical executions on deep neural networks, despite that this theory relies on assuming convexity. When applied to the warmup-stable-decay (wsd) schedule, our theory shows the optimal convergence rate of $mathcal{O}(1/sqrt{T})$. We then use convexity to design a new adaptive Polyak learning rate schedule for schedule-free. We prove an optimal anytime last-iterate convergence for our new Polyak schedule, and show that it performs well compared to a number of baselines on a black-box model distillation task.