🤖 AI Summary
Traditional k-fold cross-validation often yields overly optimistic validation performance but poor generalization for inherently unstable yet interpretable models—such as sparse regression and CART—due to their sensitivity to training data perturbations.
Method: We propose a nested k-fold cross-validation framework that explicitly incorporates an empirical stability regularizer based on prediction perturbation during hyperparameter selection. A doubly nested structure adaptively learns stability-aware weights, integrating stability constraints directly into the hyperparameter optimization pipeline.
Contribution/Results: This is the first work to systematically embed stability regularization into hyperparameter tuning while preserving predictive accuracy. Experiments across 13 UCI datasets show that our method reduces out-of-sample mean squared error by 4% on average for sparse ridge regression and CART, with no significant performance change for stable models like XGBoost—demonstrating both efficacy and specificity in enhancing generalization for unstable, interpretable models.
📝 Abstract
We revisit the problem of ensuring strong test-set performance via cross-validation. Motivated by the generalization theory literature, we propose a nested k-fold cross-validation scheme that selects hyperparameters by minimizing a weighted sum of the usual cross-validation metric and an empirical model-stability measure. The weight on the stability term is itself chosen via a nested cross-validation procedure. This reduces the risk of strong validation set performance and poor test set performance due to instability. We benchmark our procedure on a suite of 13 real-world UCI datasets, and find that, compared to k-fold cross-validation over the same hyperparameters, it improves the out-of-sample MSE for sparse ridge regression and CART by 4% on average, but has no impact on XGBoost. This suggests that for interpretable and unstable models, such as sparse regression and CART, our approach is a viable and computationally affordable method for improving test-set performance.