🤖 AI Summary
Mechanistic interpretability (MI) lacks unified, actionable causal evaluation criteria, hindering its advancement. This paper addresses this gap by systematically integrating four classical philosophical accounts of explanation—Bayesian, Kuhnian, Deductive-Nomological, and Mechanistic—into the first multidimensional evaluation framework tailored for mechanistic explanations. We introduce the “compact proof” paradigm: a novel explanatory form that jointly satisfies concision, unification, and generality, and formally specify its generation and verification procedures. Empirical evaluation demonstrates that this paradigm significantly improves explanation quality across diverse models and tasks. Beyond resolving MI’s evaluation bottleneck, our work identifies three foundational research directions: formalizing concision, modeling explanatory unification, and deriving domain-general principles. The framework provides theoretical foundations and methodological tools for building trustworthy AI systems that are monitorable, predictable, and controllable.
📝 Abstract
Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question"What makes a good explanation?"We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.