🤖 AI Summary
Hyperparameter optimization (HPO) methods suffer from a lack of systematic, reproducible evaluation benchmarks, hindering fair and rigorous comparative analysis.
Method: This paper introduces the first standardized HPO benchmark framework supporting four task categories: black-box, multi-fidelity, multi-objective, and multi-fidelity multi-objective optimization. It proposes a star-discrepancy minimization algorithm for selecting representative task subsets, generating compact, high-coverage sets of 10–30 tasks per category—extensible via dynamic updates. The framework integrates 3,336 tasks from five mainstream benchmark suites and 28 optimizer variants, offering a lightweight unified API and fully automated analysis pipeline.
Contribution/Results: We release the first open-source, reproducible HPO baseline results repository. The framework significantly enhances efficiency and rigor in HPO method prototyping, performance comparison, and reproducibility-aware evaluation.
📝 Abstract
Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (https://www.github.com/automl/CARP-S), we make an important step in the standardization of HPO evaluation.