Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

📅 2025-07-25

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Random forests (RFs) achieve strong predictive performance on tabular data but suffer from limited interpretability, hindering their trustworthy deployment in high-stakes domains such as healthcare. To address this, we propose Forest-Guided Clustering (FGC), the first method to leverage instance-shared decision paths within an RF as intrinsic clustering criteria—enabling model-internal interpretability that simultaneously captures local decision logic and global structural patterns. FGC produces semantically coherent clusters and concurrently derives both cluster-specific and global feature importances, overcoming key limitations of post-hoc explanation techniques. On benchmark datasets, FGC accurately recovers latent subtypes. Applied to acute myeloid leukemia (AML) transcriptomic data, it successfully isolates biologically consistent disease subgroups while disentangling confounding signals, empirically validating its interpretability and practical utility.

Technology Category

Application Category

📝 Abstract

As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.

Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability of Random Forests for sensitive applications

Grouping instances by shared decision paths in RFs

Uncovering latent structures and feature importance in RFs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouping instances by shared decision paths

Computing cluster-specific feature importance scores

Uncovering biologically coherent subpopulations in data

🔎 Similar Papers

Randomization Can Reduce Both Bias and Variance: A Case Study in Random Forests