Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?

📅 2025-02-28

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This study addresses the core challenge in eXplainable Machine Learning (XML) evaluation: selecting task-appropriate explanation methods. We propose a dual-track evaluation framework integrating LLM-as-a-Judge with human assessment, systematically comparing their performance on the Iris classification task across subjective (interpretability, plausibility) and objective (fidelity, consistency) metrics. Our work provides the first empirical characterization of LLMs’ effectiveness boundaries in XML evaluation: LLMs achieve high agreement with human judgments on subjective quality (Spearman ρ > 0.8), yet exhibit significant divergence on domain-knowledge-dependent objective metrics. Results indicate that while LLMs serve as efficient, scalable auxiliary evaluators, they cannot yet supplant human judgment for critical objective dimensions. This yields methodological insights for trustworthy XML evaluation and establishes empirically grounded practical boundaries for LLM-assisted assessment.

Technology Category

Application Category

📝 Abstract

EXplainable machine learning (XML) has recently emerged to address the mystery mechanisms of machine learning (ML) systems by interpreting their 'black box' results. Despite the development of various explanation methods, determining the most suitable XML method for specific ML contexts remains unclear, highlighting the need for effective evaluation of explanations. The evaluating capabilities of the Transformer-based large language model (LLM) present an opportunity to adopt LLM-as-a-Judge for assessing explanations. In this paper, we propose a workflow that integrates both LLM-based and human judges for evaluating explanations. We examine how LLM-based judges evaluate the quality of various explanation methods and compare their evaluation capabilities to those of human judges within an iris classification scenario, employing both subjective and objective metrics. We conclude that while LLM-based judges effectively assess the quality of explanations using subjective metrics, they are not yet sufficiently developed to replace human judges in this role.

Problem

Research questions and friction points this paper is trying to address.

Evaluate XML methods for ML contexts using LLM.

Compare LLM and human judges in explanation assessment.

Assess LLM's capability in subjective explanation evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-as-a-Judge for XML evaluation

Combines LLM and human judges

Subjective and objective metrics used

🔎 Similar Papers

FaithLM: Towards Faithful Explanations for Large Language Models