Feature salience -- not task-informativeness -- drives machine learning model explanations

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study investigates whether current explainable artificial intelligence (XAI) methods genuinely reflect task-relevant information learned by models or are instead dominated by irrelevant factors such as input feature salience. By introducing controllable class-relevant and class-irrelevant watermarks in image classification tasks, the authors systematically evaluate five prominent attribution methods—including Grad-CAM and Integrated Gradients. The experiments reveal that all methods substantially overestimate the importance of watermark regions (R² ≥ 0.45 for region importance weighting), while the association between watermarks and class labels has negligible influence (R² ≤ 0.03). This provides the first empirical evidence that contemporary XAI attributions are primarily driven by feature salience rather than task-relevant information, thereby challenging a foundational assumption underlying the validity of XAI techniques.

Technology Category

Application Category

📝 Abstract

Explainable AI (XAI) promises to provide insight into machine learning models'decision processes, where one goal is to identify failures such as shortcut learning. This promise relies on the field's assumption that input features marked as important by an XAI must contain information about the target variable. However, it is unclear whether informativeness is indeed the main driver of importance attribution in practice, or if other data properties such as statistical suppression, novelty at test-time, or high feature salience substantially contribute. To clarify this, we trained deep learning models on three variants of a binary image classification task, in which translucent watermarks are either absent, act as class-dependent confounds, or represent class-independent noise. Results for five popular attribution methods show substantially elevated relative importance in watermarked areas (RIW) for all models regardless of the training setting ($R^2 \geq .45$). By contrast, whether the presence of watermarks is class-dependent or not only has a marginal effect on RIW ($R^2 \leq .03$), despite a clear impact impact on model performance and generalisation ability. XAI methods show similar behaviour to model-agnostic edge detection filters and attribute substantially less importance to watermarks when bright image intensities are encoded by smaller instead of larger feature values. These results indicate that importance attribution is most strongly driven by the salience of image structures at test time rather than statistical associations learned by machine learning models. Previous studies demonstrating successful XAI application should be reevaluated with respect to a possibly spurious concurrency of feature salience and informativeness, and workflows using feature attribution methods as building blocks should be scrutinised.

Problem

Research questions and friction points this paper is trying to address.

feature salience

explainable AI

attribution methods

shortcut learning

feature importance

Innovation

Methods, ideas, or system contributions that make the work stand out.

feature salience

explainable AI

attribution methods