Overtuning in Hyperparameter Optimization

📅 2025-06-24

📈 Citations: 0

✨ Influential: 0

career value

133K/year

🤖 AI Summary

This paper identifies and formalizes “overtuning”—a previously overlooked form of overfitting in hyperparameter optimization (HPO)—where excessive optimization of biased, high-variance validation error estimates leads to selected configurations that generalize worse on unseen test data than the default configuration. Method: We rigorously distinguish overtuning from meta-overfitting and conduct a large-scale empirical study across diverse algorithms, HPO methods, and resampling strategies (holdout, cross-validation). Contribution/Results: Our analysis reveals overtuning is especially prevalent in small-sample regimes: in ~10% of cases, the HPO-selected configuration exhibits inferior test performance relative to the default; in some instances, it even underperforms the default. These findings expose the unreliability of validation error as a proxy for generalization and underscore a critical threat to AutoML robustness. The work provides both theoretical grounding and empirical evidence essential for designing HPO protocols resilient to overfitting.

Technology Category

Application Category

📝 Abstract

Hyperparameter optimization (HPO) aims to identify an optimal hyperparameter configuration (HPC) such that the resulting model generalizes well to unseen data. As the expected generalization error cannot be optimized directly, it is estimated with a resampling strategy, such as holdout or cross-validation. This approach implicitly assumes that minimizing the validation error leads to improved generalization. However, since validation error estimates are inherently stochastic and depend on the resampling strategy, a natural question arises: Can excessive optimization of the validation error lead to overfitting at the HPO level, akin to overfitting in model training based on empirical risk minimization? In this paper, we investigate this phenomenon, which we term overtuning, a form of overfitting specific to HPO. Despite its practical relevance, overtuning has received limited attention in the HPO and AutoML literature. We provide a formal definition of overtuning and distinguish it from related concepts such as meta-overfitting. We then conduct a large-scale reanalysis of HPO benchmark data to assess the prevalence and severity of overtuning. Our results show that overtuning is more common than previously assumed, typically mild but occasionally severe. In approximately 10% of cases, overtuning leads to the selection of a seemingly optimal HPC with worse generalization error than the default or first configuration tried. We further analyze how factors such as performance metric, resampling strategy, dataset size, learning algorithm, and HPO method affect overtuning and discuss mitigation strategies. Our results highlight the need to raise awareness of overtuning, particularly in the small-data regime, indicating that further mitigation strategies should be studied.

Problem

Research questions and friction points this paper is trying to address.

Investigates overtuning as HPO overfitting on validation error

Assesses prevalence and severity of overtuning in benchmarks

Analyzes factors influencing overtuning and suggests mitigations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Investigates overtuning in hyperparameter optimization

Formally defines and distinguishes overtuning phenomenon

Analyzes overtuning prevalence via benchmark reanalysis

🔎 Similar Papers

Benign Overfitting in Token Selection of Attention Mechanism