🤖 AI Summary
Existing explanation methods struggle to effectively distinguish model behaviors within Rashomon sets of comparable performance and are vulnerable to spurious explanations. To address this, this work proposes AXE, the first evaluation framework that does not rely on ground-truth explanations. AXE is grounded in three principled criteria and assesses explanation quality by analyzing the consistency of feature importance under unlabeled settings. It reliably identifies adversarial fairness-washing attacks with 100% accuracy, detects whether models implicitly leverage protected attributes, and overcomes limitations of conventional sensitivity-based or ground-truth-comparison approaches. By revealing behavioral differences among models in Rashomon sets, AXE facilitates the selection of more trustworthy models.
📝 Abstract
Explainable artificial intelligence (XAI) is concerned with producing explanations indicating the inner workings of models. For a Rashomon set of similarly performing models, explanations provide a way of disambiguating the behavior of individual models, helping select models for deployment. However explanations themselves can vary depending on the explainer used, and need to be evaluated. In the paper"Evaluating Model Explanations without Ground Truth", we proposed three principles of explanation evaluation and a new method"AXE"to evaluate the quality of feature-importance explanations. We go on to illustrate how evaluation metrics that rely on comparing model explanations against ideal ground truth explanations obscure behavioral differences within a Rashomon set. Explanation evaluation aligned with our proposed principles would highlight these differences instead, helping select models from the Rashomon set. The selection of alternate models from the Rashomon set can maintain identical predictions but mislead explainers into generating false explanations, and mislead evaluation methods into considering the false explanations to be of high quality. AXE, our proposed explanation evaluation method, can detect this adversarial fairwashing of explanations with a 100% success rate. Unlike prior explanation evaluation strategies such as those based on model sensitivity or ground truth comparison, AXE can determine when protected attributes are used to make predictions.