Evaluating the Reliability of Self-Explanations in Large Language Models

📅 2024-07-19

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the authenticity and reliability of self-generated explanations by large language models (LLMs), revealing a pervasive disconnect between explanations and actual reasoning processes—across both objective and subjective tasks. Method: We introduce the first multidimensional reliability evaluation framework that explicitly distinguishes surface plausibility from genuine causal fidelity. Our approach validates explanations via reasoning trajectory alignment, integrating human annotation, logical consistency checking, gradient-based attribution analysis, and counterfactual perturbation testing to establish a reproducible, quantitative protocol for explanation trustworthiness. Contribution/Results: Empirical evaluation shows that over 60% of self-explanations from mainstream LLMs fail reasoning-path verification; moreover, improved model performance does not correlate with enhanced explanation reliability. This study establishes a novel paradigm and benchmark toolkit for rigorous LLM interpretability assessment.