🤖 AI Summary
This study addresses the critical dependence of tree boosting models’ generalization performance on hyperparameter configuration, a domain lacking systematic comparison and guiding principles among existing tuning methods. Conducting a large-scale empirical evaluation across 59 regression and classification datasets under a unified experimental framework, the authors compare prominent hyperparameter optimization approaches—including random search, grid search, Tree-structured Parzen Estimator (TPE), Gaussian process Bayesian optimization, Hyperband, and SMAC. The findings reveal that SMAC consistently outperforms other methods across most tasks, while default hyperparameters yield suboptimal results. Notably, all hyperparameters significantly influence performance, contradicting the common assumption that only a few are critical. Effective tuning requires extensive sampling (>100 trials), and for regression tasks, employing early stopping proves superior to including the number of iterations within the search space.
📝 Abstract
Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using $59$ regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than $100$ is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.