Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

📅 2024-10-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Conventional random forests employ bootstrap sampling rates (BR) ≤ 1.0, yet the impact of BR > 1.0 remains underexplored. Method: We systematically investigate BR ∈ [1.2, 5.0] across 36 diverse benchmark datasets, employing rigorous statistical significance tests and tree-structure analysis. Contribution/Results: We provide the first empirical evidence that BR > 1.0 significantly improves classification accuracy—outperforming standard BR = 1.0 on most datasets. Crucially, we find that optimal BR is primarily determined by intrinsic dataset characteristics—not model hyperparameters—enabling us to train a data-feature-based binary classifier that predicts high-gain BR intervals with 81.88%–88.81% accuracy. This work reveals the positive role of controlled oversampling in enhancing ensemble generalization and introduces the first generalizable, data-driven strategy for adaptive BR selection, substantially improving random forest robustness and applicability.

Technology Category

Application Category

📝 Abstract

Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set ($N$). Research literature indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR $leq$ 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is $leq$ 1 or $>$ 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration.

Problem

Research questions and friction points this paper is trying to address.

Investigating bootstrap sampling rates exceeding 1.0 for random forests

Evaluating impact of high bootstrap rates on classification accuracy

Identifying dataset characteristics determining optimal bootstrap sampling rate

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using bootstrap rates greater than 1.0 for random forests

Evaluating BR values from 1.2 to 5.0 on datasets

Optimal BR determined by dataset characteristics not hyperparameters

🔎 Similar Papers

Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests