🤖 AI Summary
This study addresses the critical gap between the explanations generated by state-of-the-art explainable AI methods for legal language models and the judgments of legal experts, which often exhibit significant discrepancies that undermine transparency and trustworthiness in legal applications. The authors propose a model-agnostic interpretability evaluation framework that extracts concise, human-understandable rationales from input texts and, for the first time, incorporates systematic manual assessments by legal experts on rationales produced by models classifying cases from the European Court of Human Rights (ECtHR). By integrating faithfulness metrics—such as normalized sufficiency and comprehensiveness—with expert-based reasonableness scores and exploring the viability of LLM-as-a-Judge as an automated proxy, the work reveals that despite strong performance on classification tasks and quantitative benchmarks, current interpretability methods yield rationales fundamentally misaligned with expert reasoning, highlighting their inadequacy for real-world legal practice.
📝 Abstract
Interpretability is critical for applications of large language models in the legal domain which requires trust and transparency. While some studies develop task-specific approaches, other use the classification model's parameters to explain the decisions. However, which technique explains the legal outcome prediction best remains an open question. To address this challenge, we propose a comparative analysis framework for model-agnostic interpretability techniques. Among these, we employ two rationale extraction methods, which justify outcomes with human-interpretable and concise text fragments (i.e., rationales) from the given input text. We conduct comparison by evaluating faithfulness-via normalized sufficiency and comprehensiveness metrics along with plausibility-by asking legal experts to evaluate extracted rationales. We further assess the feasibility of LLM-as-a-Judge using legal expert evaluation results. We show that the model's"reasons"for predicting a violation differ substantially from those of legal experts, despite highly promising quantitative analysis results and reasonable downstream classification performance. The source code of our experiments is publicly available at https://github.com/trusthlt/IntEval.