🤖 AI Summary
This work addresses black-box knowledge distillation without hyperparameter tuning, aiming to efficiently compress a large teacher GPT-2 model into a compact student model. Methodologically, we directly apply the stochastic Polyak stepsize (SPS*)—and its momentum-enhanced variant—to train the student model, eliminating all manual hyperparameter optimization. Theoretically, we establish, for the first time, that SPS* achieves an $O(1/sqrt{t})$ convergence rate at *any* iteration under local smoothness or Lipschitz continuity of the loss—without requiring global convexity or smoothness—and prove that the momentum variant ensures the *final iterate* attains the same optimal rate. Empirically, our approach consistently improves student model performance across diverse distillation settings and significantly outperforms carefully tuned baseline methods in stability and accuracy.
📝 Abstract
We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.