🤖 AI Summary
Random forests (RFs) achieve strong predictive performance on tabular data but suffer from limited interpretability, hindering their trustworthy deployment in high-stakes domains such as healthcare. To address this, we propose Forest-Guided Clustering (FGC), the first method to leverage instance-shared decision paths within an RF as intrinsic clustering criteria—enabling model-internal interpretability that simultaneously captures local decision logic and global structural patterns. FGC produces semantically coherent clusters and concurrently derives both cluster-specific and global feature importances, overcoming key limitations of post-hoc explanation techniques. On benchmark datasets, FGC accurately recovers latent subtypes. Applied to acute myeloid leukemia (AML) transcriptomic data, it successfully isolates biologically consistent disease subgroups while disentangling confounding signals, empirically validating its interpretability and practical utility.
📝 Abstract
As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.