Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work investigates the dual effect of iterative self-training on generalization in high-dimensional overparameterized linear regression: it simultaneously suppresses noise through data-dependent projection while introducing systematic bias due to signal forgetting. By leveraging a spiked covariance model and deterministic equivalent recursions, the study provides the first explicit disentanglement of the denoising and signal-forgetting mechanisms, revealing that their competition underlies the characteristic U-shaped test risk curve. The theoretical analysis, combining spectral methods and concentration inequalities, establishes that the empirical risk along the iteration path concentrates around its deterministic limit. Furthermore, the authors propose an iterated generalized cross-validation (GCV) criterion with uniform convergence guarantees, enabling automatic selection of both the optimal stopping time and regularization strength. Experiments corroborate the denoising–forgetting trade-off and highlight self-training’s implicit soft feature selection capability, distinct from the spectral filtering mechanism of ridge regression.

Technology Category

Application Category

📝 Abstract

Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.

Problem

Research questions and friction points this paper is trying to address.

self-training

overparameterized regression

signal forgetting

denoising

prediction risk

Innovation

Methods, ideas, or system contributions that make the work stand out.

self-training

denoising

signal forgetting