Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

259K/year

🤖 AI Summary

Existing neuron concept identification in deep neural networks lacks theoretical foundations, particularly regarding faithfulness (whether identified concepts reflect true functional roles) and stability (consistency across datasets). Method: This paper establishes the first formal interpretability framework for concept identification, introducing an “inverse machine learning” modeling perspective. It derives a fidelity generalization bound, designs a quantitative stability metric, and proposes Bootstrap Explanation (BE)—a novel method that produces concept prediction sets with statistical coverage guarantees (≥95%). Contribution/Results: Through rigorous theoretical analysis, extensions to CLIP-Dissect, and experiments on both synthetic and real-world vision models, the framework significantly improves robustness and reproducibility of concept identification. It provides the first solution that jointly ensures theoretical rigor—via provable bounds and statistical guarantees—and practical effectiveness—demonstrated across diverse architectures and domains—thereby enabling trustworthy mechanistic analysis of deep networks.

Technology Category

Application Category

📝 Abstract

Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.

Problem

Research questions and friction points this paper is trying to address.

Develops theoretical guarantees for neuron concept identification faithfulness

Ensures stability of neuron explanations across different probing datasets

Proposes methods to quantify and improve neuron explanation reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical analysis of faithfulness and stability in neuron identification

Generalization bounds for similarity metrics to ensure faithful explanations

Bootstrap ensemble procedure for stability quantification and concept prediction sets

🔎 Similar Papers

Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach