🤖 AI Summary
This study addresses the gap in evaluating Shapley value–based explanation methods, which has relied on quantitative metrics disconnected from human utility and lacked empirical validation in high-stakes settings. The authors propose a unified amortized computation framework to systematically compare eight Shapley variants within a real-world, low-latency risk control system. Leveraging expert analyst reviews of 3,735 cases, they conduct the first large-scale assessment of these methods’ impact on human decision-making. Their findings reveal that commonly used metrics—such as sparsity and faithfulness—fail to predict human-perceived clarity or decision utility. Although none of the methods improved objective performance, all significantly increased decision confidence, exposing risks of automation bias and underscoring the critical need for human-centered evaluation in explainable AI.
📝 Abstract
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.