🤖 AI Summary
This work addresses the optimal design of spectral shrinkage estimators under spiked covariance models. By developing a self-distillation–based statistical framework, it establishes—for the first time—that performing exactly $s$ steps of self-distillation achieves the minimax risk lower bound within the class of spectral shrinkage estimators when the true covariance exhibits $s$ spikes, whereas fewer than $s$ steps yields strictly suboptimal performance. The approach unifies ridge regression and spectral shrinkage perspectives and extends naturally to federated learning settings, yielding optimal local estimation and aggregation rules. These results elucidate the statistical optimality of self-distillation and the source of its performance gains, providing both theoretical guarantees and practical guidelines for covariance estimation in both centralized and federated regimes.
📝 Abstract
Self-distillation has emerged as a promising technique for improving model performance in modern machine learning systems. We develop the statistical foundations of self-distillation in spiked covariance models, by introducing and analyzing a broad class of estimators, namely spectral shrinkage estimators. We establish that for spiked covariance matrices with $s$ spikes, $s$-step self-distillation achieves optimal performance among spectral shrinkage estimators, outperforming well-known estimators in statistics and machine learning. Moreover, we show that $s$ steps are necessary for optimality: any $(s-k)$-step distilled estimator is strictly suboptimal for $1 \leq k \leq s$. For the special subclass of isotropic covariances, we show that optimally tuned Ridge regression performs best among spectral shrinkage estimators. We also study a federated approach where multiple data centers share spectral shrinkage estimators and a common server seeks to aggregate them to achieve optimal performance. In this case, we find that the best local rule again takes the form of self-distillation, though it differs from the optimal rule when data are hosted centrally on a single server. Together, our results elucidate why self-distillation improves predictive performance and provide a broader statistical framework connecting it with classical shrinkage-based methods.