🤖 AI Summary
Sample dependence in network-linked data reduces effective sample size while simultaneously encoding exploitable structural information. Existing methods struggle to balance predictive accuracy and interpretability: graph neural networks (GNNs) achieve high performance but lack transparency, whereas linear network regression is interpretable yet insufficiently accurate.
Method: We propose Interpretable Network-Enhanced Random Forests (RF+), which integrates neighborhood information propagation mechanisms into random forests and establishes a multi-level framework for quantifying feature importance (global and local) and sample influence.
Contribution/Results: RF+ achieves GNN-level predictive accuracy on mainstream benchmarks—significantly outperforming conventional interpretable models—while delivering a complete, intuitive, and auditable suite of explanation tools. Its transparency, fidelity, and computational efficiency make it particularly suitable for high-stakes decision-making scenarios requiring both reliability and accountability.
📝 Abstract
Machine learning algorithms often assume that training samples are independent. When data points are connected by a network, the induced dependency between samples is both a challenge, reducing effective sample size, and an opportunity to improve prediction by leveraging information from network neighbors. Multiple methods taking advantage of this opportunity are now available, but many, including graph neural networks, are not easily interpretable, limiting their usefulness for understanding how a model makes its predictions. Others, such as network-assisted linear regression, are interpretable but often yield substantially worse prediction performance. We bridge this gap by proposing a family of flexible network-assisted models built upon a generalization of random forests (RF+), which achieves highly-competitive prediction accuracy and can be interpreted through feature importance measures. In particular, we develop a suite of interpretation tools that enable practitioners to not only identify important features that drive model predictions, but also quantify the importance of the network contribution to prediction. Importantly, we provide both global and local importance measures as well as sample influence measures to assess the impact of a given observation. This suite of tools broadens the scope and applicability of network-assisted machine learning for high-impact problems where interpretability and transparency are essential.