Forest-Guided Clustering -- Shedding Light into the Random Forest Black Box

📅 2025-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Random forests (RFs) achieve strong predictive performance on tabular data but suffer from limited interpretability, hindering their trustworthy deployment in high-stakes domains such as healthcare. To address this, we propose Forest-Guided Clustering (FGC), the first method to leverage instance-shared decision paths within an RF as intrinsic clustering criteria—enabling model-internal interpretability that simultaneously captures local decision logic and global structural patterns. FGC produces semantically coherent clusters and concurrently derives both cluster-specific and global feature importances, overcoming key limitations of post-hoc explanation techniques. On benchmark datasets, FGC accurately recovers latent subtypes. Applied to acute myeloid leukemia (AML) transcriptomic data, it successfully isolates biologically consistent disease subgroups while disentangling confounding signals, empirically validating its interpretability and practical utility.

Technology Category

Application Category

📝 Abstract
As machine learning models are increasingly deployed in sensitive application areas, the demand for interpretable and trustworthy decision-making has increased. Random Forests (RF), despite their widespread use and strong performance on tabular data, remain difficult to interpret due to their ensemble nature. We present Forest-Guided Clustering (FGC), a model-specific explainability method that reveals both local and global structure in RFs by grouping instances according to shared decision paths. FGC produces human-interpretable clusters aligned with the model's internal logic and computes cluster-specific and global feature importance scores to derive decision rules underlying RF predictions. FGC accurately recovered latent subclass structure on a benchmark dataset and outperformed classical clustering and post-hoc explanation methods. Applied to an AML transcriptomic dataset, FGC uncovered biologically coherent subpopulations, disentangled disease-relevant signals from confounders, and recovered known and novel gene expression patterns. FGC bridges the gap between performance and interpretability by providing structure-aware insights that go beyond feature-level attribution.
Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability of Random Forests for sensitive applications
Grouping instances by shared decision paths in RFs
Uncovering latent structures and feature importance in RFs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Grouping instances by shared decision paths
Computing cluster-specific feature importance scores
Uncovering biologically coherent subpopulations in data
🔎 Similar Papers
No similar papers found.
L
Lisa Barros de Andrade e Sousa
Helmholtz AI, Helmholtz Munich, Ingolstädter Landstraße 1, Neuherberg, 85764, Bavaria, Germany.
Gregor Miller
Gregor Miller
Core Facility Statistical Consulting, Helmholtz Munich, Ingolstädter Landstraße 1, Neuherberg, 85764, Bavaria, Germany.
R
Ronan Le Gleut
Core Facility Statistical Consulting, Helmholtz Munich, Ingolstädter Landstraße 1, Neuherberg, 85764, Bavaria, Germany.
D
Dominik Thalmeier
Helmholtz AI, Helmholtz Munich, Ingolstädter Landstraße 1, Neuherberg, 85764, Bavaria, Germany.
H
Helena Pelin
Helmholtz AI, Helmholtz Munich, Ingolstädter Landstraße 1, Neuherberg, 85764, Bavaria, Germany.
Marie Piraud
Marie Piraud
Helmholtz AI, Helmoltz Zentrum München