Comparing zero-shot self-explanations with human rationales in text classification

📅 2024-10-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work investigates the quality of zero-shot self-explanations generated by instruction-tuned large language models (LLMs) for text classification, focusing on plausibility (human interpretability) and faithfulness (model alignment). We evaluate across English, Danish, and Italian sentiment classification, as well as cross-lingual forced labor risk detection, systematically comparing LLM-generated explanations against human-annotated rationales and XAI baselines—including Layer-wise Relevance Propagation (LRP). Our key contribution is the first empirical demonstration that zero-shot self-explanations substantially outperform LRP across multilingual and multitask settings: they achieve significantly higher plausibility while maintaining comparable faithfulness. Crucially, this improvement requires no additional training, fine-tuning, or architectural modification—ensuring lightweight deployment and broad applicability. The findings establish a novel paradigm for scalable, trustworthy model interpretation grounded in instruction-tuned LLMs’ inherent reasoning capabilities.

Technology Category

Application Category

📝 Abstract

Instruction-tuned LLMs are able to provide an explanation about their output to users by generating self-explanations. These do not require gradient computations or the application of possibly complex XAI methods. In this paper, we analyse whether this ability results in a good explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans as well as their faithfulness to models. We study two text classification tasks: sentiment classification and forced labour detection, i.e., identifying pre-defined risk indicators of forced labour. In addition to English, we include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations for all samples. To allow for direct comparisons, we also compute post-hoc feature attribution, i.e., layer-wise relevance propagation (LRP) and analyse 4 LLMs. We show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness. This finding suggests that self-explanations indeed provide good explanations for text classification.

Problem

Research questions and friction points this paper is trying to address.

Evaluate self-explanations in text classification tasks

Compare self-explanations with human annotations and LRP

Analyze plausibility and faithfulness of self-explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Instruction-tuned LLMs generate self-explanations

Self-explanations align with human annotations

Post-hoc feature attribution compared with LRP

🔎 Similar Papers

Evaluating the Reliability of Self-Explanations in Large Language Models