Explanation Regularisation through the Lens of Attributions

📅 2024-07-23
🏛️ International Conference on Computational Linguistics
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether Explanation Regularization (ER) genuinely enhances text classifiers’ reliance on human-justified features and clarifies the causal relationship between such reliance and out-of-distribution (OOD) generalization. Addressing the limitation of prior studies—namely, their exclusive reliance on a single attribution method while ignoring methodological heterogeneity—we systematically employ multiple attribution techniques (e.g., Integrated Gradients, LIME), augmented by human-annotated rationale supervision, cross-method attribution consistency analysis, and evaluation across diverse OOD benchmarks. Our results reveal only a weak correlation between ER and actual model reliance on semantically reasonable tokens; critically, ER’s OOD performance gains are not causally attributable to strengthened reliance on human-justified features, but rather stem from unidentified auxiliary mechanisms. To our knowledge, this is the first study to challenge the dominant causal interpretation of ER’s generalization benefit—via rigorous multi-method attribution validation—thereby prompting fundamental reexamination of theoretical foundations and empirical assessment practices in explainable AI.

Technology Category

Application Category

📝 Abstract
Explanation regularisation (ER) has been introduced as a way to guide text classifiers to form their predictions relying on input tokens that humans consider plausible. This is achieved by introducing an auxiliary explanation loss that measures how well the output of an input attribution technique for the model agrees with human-annotated rationales. The guidance appears to benefit performance in out-of-domain (OOD) settings, presumably due to an increased reliance on"plausible"tokens. However, previous work has under-explored the impact of guidance on that reliance, particularly when reliance is measured using attribution techniques different from those used to guide the model. In this work, we seek to close this gap, and also explore the relationship between reliance on plausible features and OOD performance. We find that the connection between ER and the ability of a classifier to rely on plausible features has been overstated and that a stronger reliance on plausible tokens does not seem to be the cause for OOD improvements.
Problem

Research questions and friction points this paper is trying to address.

Explores impact of explanation regularisation on model reliance.
Examines relationship between plausible features and OOD performance.
Challenges overstated connection between ER and plausible token reliance.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explanation Regularisation technique
Auxiliary explanation loss mechanism
Out-of-domain performance analysis