🤖 AI Summary
This study addresses the instability of feature importance estimation in highly expressive models—such as deep neural networks—which can compromise the reliability of scientific discovery in critical domains like biomedicine. Through both theoretical and empirical analysis, the authors systematically compare two strategies: model-level ensembling and explanation-level aggregation. They establish, for the first time, a theoretical advantage of model-level ensembling under nonlinear importance measures by analyzing excess risk. Comprehensive evaluations across classical benchmarks and large-scale UK Biobank proteomic data demonstrate that model-level ensembling substantially enhances the stability and accuracy of feature importance estimates, particularly when applied to high-capacity models. These findings underscore its practical value for robust interpretability in complex predictive settings.
📝 Abstract
Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.