Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the lack of rigor in evaluating the reliability of neuron-level textual explanations for neural networks. We propose the first unified mathematical framework and two verifiable sanity checks for explanation evaluation. Our approach combines formal modeling, statistical consistency analysis, and robustness testing against concept-label perturbations. Empirical results reveal that mainstream evaluation metrics consistently fail under label perturbations, exhibiting critically low sensitivity. We introduce a novel set of high-sensitivity, trustworthy evaluation metrics—proven to be theoretically comparable and statistically analyzable. Our key contributions are: (1) exposing systematic fragility in existing metrics; (2) establishing the first explanation evaluation framework grounded in both formal semantics and empirical validation; and (3) proposing seven practice-oriented principles for assessing explanation quality, thereby advancing mechanistic interpretability toward greater rigor, reproducibility, and scientific accountability.

Technology Category

Application Category

📝 Abstract

Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

Assessing reliability of neuron behavior explanations in neural networks

Unifying evaluation methods for neuron explanations under one framework

Identifying reliable metrics for future neuron explanation evaluations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for neuron explanation evaluation

Proposed two simple sanity checks

Identified reliable evaluation metrics

🔎 Similar Papers

Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach