Learning from Sufficient Rationales: Analysing the Relationship Between Explanation Faithfulness and Token-level Regularisation Strategies

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This paper investigates whether the “sufficiency” metric for natural language rationales accurately reflects their causal contribution to model decisions. Method: The authors conduct systematic cross-domain experiments, integrating token-level rationale classification tasks with rationale-augmented attention regularization to analyze how rationale information functions across diverse tasks and model architectures. Contribution/Results: They find that sufficiency exhibits no direct correlation with either a model’s ability to classify rationale tokens or the effectiveness of attention regularization guided by rationales—revealing fundamental validity limitations in this widely adopted metric. Key results include: (i) high-information rationales do not necessarily improve classification accuracy; (ii) rationale incorporation enhances cross-domain generalization, but gains are highly contingent on task and model architecture. The work challenges the implicit assumption that sufficiency serves as a reliable proxy for explanation faithfulness, advocating instead for more rigorous, mechanism-aware evaluation frameworks for interpretable AI.

Technology Category

Application Category

📝 Abstract

Human explanations of natural language, rationales, form a tool to assess whether models learn a label for the right reasons or rely on dataset-specific shortcuts. Sufficiency is a common metric for estimating the informativeness of rationales, but it provides limited insight into the effects of rationale information on model performance. We address this limitation by relating sufficiency to two modelling paradigms: the ability of models to identify which tokens are part of the rationale (through token classification) and the ability of improving model performance by incorporating rationales in the input (through attention regularisation). We find that highly informative rationales are not likely to help classify the instance correctly. Sufficiency conversely captures the classification impact of the non-rationalised context, which interferes with rationale information in the same input. We also find that incorporating rationale information in model inputs can boost cross-domain classification, but results are inconsistent per task and model type. Finally, sufficiency and token classification appear to be unrelated. These results exemplify the complexity of rationales, showing that metrics capable of systematically capturing this type of information merit further investigation.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how explanation faithfulness relates to token-level regularization strategies

Investigating the limited impact of sufficient rationales on model performance

Examining inconsistent cross-domain classification improvements from rationale incorporation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token classification identifies rationale components

Attention regularization incorporates rationales in inputs

Analyzing sufficiency metric limitations for model performance

🔎 Similar Papers

Explanation Regularisation through the Lens of Attributions