carps: A Framework for Comparing N Hyperparameter Optimizers on M Benchmarks

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Hyperparameter optimization (HPO) methods suffer from a lack of systematic, reproducible evaluation benchmarks, hindering fair and rigorous comparative analysis. Method: This paper introduces the first standardized HPO benchmark framework supporting four task categories: black-box, multi-fidelity, multi-objective, and multi-fidelity multi-objective optimization. It proposes a star-discrepancy minimization algorithm for selecting representative task subsets, generating compact, high-coverage sets of 10–30 tasks per category—extensible via dynamic updates. The framework integrates 3,336 tasks from five mainstream benchmark suites and 28 optimizer variants, offering a lightweight unified API and fully automated analysis pipeline. Contribution/Results: We release the first open-source, reproducible HPO baseline results repository. The framework significantly enhances efficiency and rigor in HPO method prototyping, performance comparison, and reproducibility-aware evaluation.

Technology Category

Application Category

📝 Abstract
Hyperparameter Optimization (HPO) is crucial to develop well-performing machine learning models. In order to ease prototyping and benchmarking of HPO methods, we propose carps, a benchmark framework for Comprehensive Automated Research Performance Studies allowing to evaluate N optimizers on M benchmark tasks. In this first release of carps, we focus on the four most important types of HPO task types: blackbox, multi-fidelity, multi-objective and multi-fidelity-multi-objective. With 3 336 tasks from 5 community benchmark collections and 28 variants of 9 optimizer families, we offer the biggest go-to library to date to evaluate and compare HPO methods. The carps framework relies on a purpose-built, lightweight interface, gluing together optimizers and benchmark tasks. It also features an analysis pipeline, facilitating the evaluation of optimizers on benchmarks. However, navigating a huge number of tasks while developing and comparing methods can be computationally infeasible. To address this, we obtain a subset of representative tasks by minimizing the star discrepancy of the subset, in the space spanned by the full set. As a result, we propose an initial subset of 10 to 30 diverse tasks for each task type, and include functionality to re-compute subsets as more benchmarks become available, enabling efficient evaluations. We also establish a first set of baseline results on these tasks as a measure for future comparisons. With carps (https://www.github.com/automl/CARP-S), we make an important step in the standardization of HPO evaluation.
Problem

Research questions and friction points this paper is trying to address.

Framework to compare N hyperparameter optimizers on M benchmarks
Standardize evaluation of HPO methods across diverse task types
Provide representative task subsets for efficient HPO benchmarking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for comparing N optimizers on M benchmarks
Lightweight interface gluing optimizers and tasks
Subset selection via star discrepancy minimization
🔎 Similar Papers
No similar papers found.