The Relative Instability of Model Comparison with Cross-validation

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

This paper identifies a fundamental theoretical limitation of cross-validation (CV) in comparing test error differences between two machine learning algorithms—e.g., soft-thresholded least squares and the Lasso—even when each algorithm individually satisfies classical stability conditions. The core issue is that the CV estimator of their error difference may lack *relative stability*, causing confidence intervals to be invalid. To address this, the authors formally define and rigorously analyze the notion of relative stability, proving that standard stability frameworks fail to ensure validity of CV-based confidence intervals for error differences. Through theoretical analysis within sparse low-dimensional linear models—and corroborated by empirical experiments—they establish, for the first time, the intrinsic mechanism behind CV’s systematic failure in estimating error differences. Experiments confirm that resulting confidence intervals are frequently severely miscalibrated. This work provides a critical theoretical warning and principled guidance for uncertainty quantification in ML model comparison.

Technology Category

Application Category

📝 Abstract

Existing work has shown that cross-validation (CV) can be used to provide an asymptotic confidence interval for the test error of a stable machine learning algorithm, and existing stability results for many popular algorithms can be applied to derive positive instances where such confidence intervals will be valid. However, in the common setting where CV is used to compare two algorithms, it becomes necessary to consider a notion of relative stability which cannot easily be derived from existing stability results, even for simple algorithms. To better understand relative stability and when CV provides valid confidence intervals for the test error difference of two algorithms, we study the soft-thresholded least squares algorithm, a close cousin of the Lasso. We prove that while stability holds when assessing the individual test error of this algorithm, relative stability fails to hold when comparing the test error of two such algorithms, even in a sparse low-dimensional linear model setting. Additionally, we empirically confirm the invalidity of CV confidence intervals for the test error difference when either soft-thresholding or the Lasso is used. In short, caution is needed when quantifying the uncertainty of CV estimates of the performance difference of two machine learning algorithms, even when both algorithms are individually stable.

Problem

Research questions and friction points this paper is trying to address.

Assessing relative stability in cross-validation model comparison

Validating confidence intervals for test error differences

Instability in comparing soft-thresholded least squares algorithms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Studies relative stability in cross-validation comparisons

Analyzes soft-thresholded least squares algorithm

Empirically invalidates CV confidence intervals

🔎 Similar Papers

Distributional bias compromises leave-one-out cross-validation