Analysis of an Idealized Stochastic Polyak Method and its Application to Black-Box Model Distillation

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses black-box knowledge distillation without hyperparameter tuning, aiming to efficiently compress a large teacher GPT-2 model into a compact student model. Methodologically, we directly apply the stochastic Polyak stepsize (SPS*)—and its momentum-enhanced variant—to train the student model, eliminating all manual hyperparameter optimization. Theoretically, we establish, for the first time, that SPS* achieves an $O(1/sqrt{t})$ convergence rate at *any* iteration under local smoothness or Lipschitz continuity of the loss—without requiring global convexity or smoothness—and prove that the momentum variant ensures the *final iterate* attains the same optimal rate. Empirically, our approach consistently improves student model performance across diverse distillation settings and significantly outperforms carefully tuned baseline methods in stability and accuracy.

Technology Category

Application Category

📝 Abstract

We provide a general convergence theorem of an idealized stochastic Polyak step size called SPS$^*$. Besides convexity, we only assume a local expected gradient bound, that includes locally smooth and locally Lipschitz losses as special cases. We refer to SPS$^*$ as idealized because it requires access to the loss for every training batch evaluated at a solution. It is also ideal, in that it achieves the optimal lower bound for globally Lipschitz function, and is the first Polyak step size to have an $O(1/sqrt{t})$ anytime convergence in the smooth setting. We show how to combine SPS$^*$ with momentum to achieve the same favorable rates for the last iterate. We conclude with several experiments to validate our theory, and a more practical setting showing how we can distill a teacher GPT-2 model into a smaller student model without any hyperparameter tuning.

Problem

Research questions and friction points this paper is trying to address.

Analyzing convergence of idealized stochastic Polyak step size (SPS$^*$).

Combining SPS$^*$ with momentum for optimal convergence rates.

Applying SPS$^*$ to distill GPT-2 model without hyperparameter tuning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Idealized stochastic Polyak step size SPS$^*$

Combines SPS$^*$ with momentum

Distills teacher model without tuning

🔎 Similar Papers

Revisiting Knowledge Distillation under Distribution Shift