🤖 AI Summary
This work addresses the lack of theoretical guarantees for unconstrained self-distillation when mixture weights fall outside the [0,1] interval and the unclear mechanism behind its generalization improvement. Focusing on ridge regression, the paper provides the first rigorous proof that unconstrained self-distillation can outperform the original teacher model, deriving a closed-form expression for the optimal mixture weight along with its sign rule. By leveraging block linearization and fourth-order deterministic equivalents—combined with proportional asymptotics and anisotropic covariance modeling—the authors establish performance guarantees under non-stationary regularization. The proposed one-shot tuning method efficiently estimates the optimal weight without grid search, significantly reducing prediction risk on both real-world datasets and features extracted from pretrained neural networks.
📝 Abstract
Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.