Evaluating the Reliability of Self-Explanations in Large Language Models

📅 2024-07-19
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the authenticity and reliability of self-generated explanations by large language models (LLMs), revealing a pervasive disconnect between explanations and actual reasoning processes—across both objective and subjective tasks. Method: We introduce the first multidimensional reliability evaluation framework that explicitly distinguishes surface plausibility from genuine causal fidelity. Our approach validates explanations via reasoning trajectory alignment, integrating human annotation, logical consistency checking, gradient-based attribution analysis, and counterfactual perturbation testing to establish a reproducible, quantitative protocol for explanation trustworthiness. Contribution/Results: Empirical evaluation shows that over 60% of self-explanations from mainstream LLMs fail reasoning-path verification; moreover, improved model performance does not correlate with enhanced explanation reliability. This study establishes a novel paradigm and benchmark toolkit for rigorous LLM interpretability assessment.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Language Model Evaluation
Text Generation Interpretation
Objective and Subjective Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models
Causal Inference
Interpretability Enhancement
🔎 Similar Papers
No similar papers found.