🤖 AI Summary
Addressing the challenge of simultaneously achieving accuracy, interpretability, and robustness in variable selection and prediction under high-dimensional noisy data, this paper proposes a hybrid modeling paradigm that synergistically integrates regularization techniques with nonlinear models. Specifically, the framework unifies sparse regularizers—including Lasso, Ridge, and Elastic Net—with tree-based ensemble methods such as Random Forest, XGBoost, and LightGBM. We systematically evaluate performance on the Friedman simulation benchmark using RMSE, Jaccard index, and recall. Theoretically, the approach guarantees asymptotic consistency and strong generalization. Empirically, it outperforms both pure regularized and pure black-box models: it improves predictive accuracy while substantially enhancing variable identification consistency and robustness—particularly as sample size increases. This work thus provides a novel, trustworthy decision-support pathway that reconciles interpretability with statistical stability.
📝 Abstract
This study explores how different types of supervised models perform in the task of predicting and selecting relevant variables in high-dimensional contexts, especially when the data is very noisy. We analyzed three approaches: regularized models (such as Lasso, Ridge, and Elastic Net), black-box models (such as Random Forest, XGBoost, LightGBM, CatBoost, and H2O GBM), and hybrid models that combine both approaches: regularization with nonlinear algorithms. Based on simulations inspired by the Friedman equation, we evaluated 23 models using three complementary metrics: RMSE, Jaccard index, and recall rate. The results reveal that, although black-box models excel in predictive accuracy, they lack interpretability and simplicity, essential factors in many real-world contexts. Regularized models, on the other hand, proved to be more sensitive to an excess of irrelevant variables. In this scenario, hybrid models stood out for their balance: they maintain good predictive performance, identify relevant variables more consistently, and offer greater robustness, especially as the sample size increases. Therefore, we recommend using this hybrid framework in market applications, where it is essential that the results make sense in a practical context and support decisions with confidence.