Statistical Inference for Gradient Boosting Regression

📅 2025-09-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Gradient boosting regression achieves high predictive accuracy but lacks rigorous statistical inference and uncertainty quantification. To address this, we propose the first unified framework integrating dropout regularization, parallel tree training, and asymptotic theory—grounded in the central limit theorem—to enable valid statistical inference. Our method supports confidence intervals for parameters, prediction intervals, and significance testing for variable importance. Theoretically, we show that increasing the dropout rate and the number of parallel trees—within appropriate ranges—enhances signal recovery. Empirically, the approach maintains competitive prediction accuracy while substantially improving variable selection consistency and robustness of uncertainty estimates. Crucially, it bridges, for the first time, the longstanding gap between gradient boosting’s empirical success and principled, interpretable statistical inference.

Technology Category

Application Category

📝 Abstract
Gradient boosting is widely popular due to its flexibility and predictive accuracy. However, statistical inference and uncertainty quantification for gradient boosting remain challenging and under-explored. We propose a unified framework for statistical inference in gradient boosting regression. Our framework integrates dropout or parallel training with a recently proposed regularization procedure that allows for a central limit theorem (CLT) for boosting. With these enhancements, we surprisingly find that increasing the dropout rate and the number of trees grown in parallel at each iteration substantially enhances signal recovery and overall performance. Our resulting algorithms enjoy similar CLTs, which we use to construct built-in confidence intervals, prediction intervals, and rigorous hypothesis tests for assessing variable importance. Numerical experiments demonstrate that our algorithms perform well, interpolate between regularized boosting and random forests, and confirm the validity of their built-in statistical inference procedures.
Problem

Research questions and friction points this paper is trying to address.

Develop statistical inference framework for gradient boosting regression
Enable uncertainty quantification via dropout and parallel training
Construct confidence intervals and hypothesis tests for variable importance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dropout and parallel training enhance gradient boosting
Central limit theorem enables built-in confidence intervals
Algorithms interpolate between boosting and random forests
🔎 Similar Papers
No similar papers found.
H
Haimo Fang
School of Economics, Fudan University
K
Kevin Tan
Department of Statistics and Data Science, The Wharton School, University of Pennsylvania
Giles Hooker
Giles Hooker
Professor of Statistics and Data Science, University of Pennsylvania
StatisticsMachine LearningDynamical Systems