๐ค AI Summary
This paper identifies pervasive class-dependent evaluation effects in time-series feature attribution assessment, challenging the reliability of perturbation-based metrics in eXplainable AI (XAI). Method: Using controllable synthetic time-series data with ground-truth attributions, we systematically analyze how feature types and inter-class differences affect evaluation outcomes. Contribution/Results: We findโfirstlyโthat even in simple temporal settings, perturbation-based metrics (e.g., deletion/insertion curves) exhibit strong class dependence and frequently contradict ground-truth metrics (e.g., precision-recall). Second, minor inter-class differences in feature amplitude or duration significantly bias evaluation results. Third, the average correlation between perturbation- and ground-truth-based metrics is extremely low (r < 0.3), and over 50% of class pairs yield entirely reversed rankings. These findings undermine core assumptions of consistency and trustworthiness in current XAI evaluation paradigms, providing critical empirical evidence for redesigning attribution methods and establishing more robust, class-agnostic evaluation standards.
๐ Abstract
Evaluating feature attribution methods represents a critical challenge in explainable AI (XAI), as researchers typically rely on perturbation-based metrics when ground truth is unavailable. However, recent work demonstrates that these evaluation metrics can show different performance across predicted classes within the same dataset. These"class-dependent evaluation effects"raise questions about whether perturbation analysis reliably measures attribution quality, with direct implications for XAI method development and the trustworthiness of evaluation techniques. We investigate under which conditions these class-dependent effects arise by conducting controlled experiments with synthetic time series data where ground truth feature locations are known. We systematically vary feature types and class contrasts across binary classification tasks, then compare perturbation-based degradation scores with ground truth-based precision-recall metrics using multiple attribution methods. Our experiments demonstrate that class-dependent effects emerge with both evaluation approaches even in simple scenarios with temporally localized features, triggered by basic variations in feature amplitude or temporal extent between classes. Most critically, we find that perturbation-based and ground truth metrics frequently yield contradictory assessments of attribution quality across classes, with weak correlations between evaluation approaches. These findings suggest that researchers should interpret perturbation-based metrics with care, as they may not always align with whether attributions correctly identify discriminating features. These findings reveal opportunities to reconsider what attribution evaluation actually measures and to develop more comprehensive evaluation frameworks that capture multiple dimensions of attribution quality.