🤖 AI Summary
Conventional random forests employ bootstrap sampling rates (BR) ≤ 1.0, yet the impact of BR > 1.0 remains underexplored. Method: We systematically investigate BR ∈ [1.2, 5.0] across 36 diverse benchmark datasets, employing rigorous statistical significance tests and tree-structure analysis. Contribution/Results: We provide the first empirical evidence that BR > 1.0 significantly improves classification accuracy—outperforming standard BR = 1.0 on most datasets. Crucially, we find that optimal BR is primarily determined by intrinsic dataset characteristics—not model hyperparameters—enabling us to train a data-feature-based binary classifier that predicts high-gain BR intervals with 81.88%–88.81% accuracy. This work reveals the positive role of controlled oversampling in enhancing ensemble generalization and introduces the first generalizable, data-driven strategy for adaptive BR selection, substantially improving random forest robustness and applicability.
📝 Abstract
Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set ($N$). Research literature indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR $leq$ 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is $leq$ 1 or $>$ 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration.