🤖 AI Summary
This work addresses the tendency of deep neural networks to lose generalization on specific subpopulations during training due to optimization trajectory drift, even when overall validation accuracy remains high. To mitigate this, the authors propose an online self-distillation framework that dynamically identifies and retains early-model states exhibiting regional expertise as lightweight expert anchors, guided by a validation-informed marginal coverage score. These anchors are aggregated via a coverage-aware weighted ensemble to regularize the loss landscape and stabilize the optimization trajectory. The method introduces, for the first time, a validation-driven trajectory consistency mechanism that significantly enhances model robustness and subpopulation generalization across multiple benchmarks—outperforming both standard training and existing self-distillation approaches—while incurring minimal storage overhead (a 90% reduction).
📝 Abstract
Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.