Integrating Random Forests and Generalized Linear Models for Improved Accuracy and Interpretability

📅 2023-07-04

📈 Citations: 11

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Random forests (RFs), while widely adopted, suffer from instability and poor interpretability of feature importance measures (e.g., Mean Decrease in Impurity, MDI) due to their black-box nature, and exhibit limited capacity for modeling additive or smooth structures. To address these limitations, we propose RF+, the first framework that achieves theoretically consistent integration of RFs with generalized linear models (GLMs). RF+ reformulates individual trees as linear regressors, redefines MDI as an R²-type metric (MDI+), and incorporates decision-path-based feature engineering with joint optimization. This design preserves RF’s nonlinear approximation capability while ensuring statistical interpretability. Experiments across diverse synthetic and real-world datasets—including drug response prediction and breast cancer subtyping—demonstrate that RF+ surpasses standard RF in predictive accuracy, improves signal feature identification accuracy by over 10%, and significantly enhances the stability of key gene selection.

📝 Abstract

Random forests (RFs) are among the most popular supervised learning algorithms due to their nonlinear flexibility and ease-of-use. However, as black box models, they can only be interpreted via algorithmically-defined feature importance methods, such as Mean Decrease in Impurity (MDI), which have been observed to be highly unstable and have ambiguous scientific meaning. Furthermore, they can perform poorly in the presence of smooth or additive structure. To address this, we reinterpret decision trees and MDI as linear regression and $R^2$ values, respectively, with respect to engineered features associated with the tree's decision splits. This allows us to combine the respective strengths of RFs and generalized linear models in a framework called RF+, which also yields an improved feature importance method we call MDI+. Through extensive data-inspired simulations and real-world datasets, we show that RF+ improves prediction accuracy over RFs and that MDI+ outperforms popular feature importance measures in identifying signal features, often yielding more than a 10% improvement over its closest competitor. In case studies on drug response prediction and breast cancer subtyping, we further show that MDI+ extracts well-established genes with significantly greater stability compared to existing feature importance measures.

Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability of random forests via linear regression

Combining RFs and GLMs to improve prediction accuracy

Developing MDI+ for more stable feature importance measurement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines Random Forests and Generalized Linear Models

Introduces RF+ framework for improved accuracy

Develops MDI+ for better feature importance

🔎 Similar Papers

No similar papers found.