🤖 AI Summary
This paper addresses the philosophical foundations and theoretical boundaries of mechanistic interpretability, centering on the question: *How can we construct explanations of neural networks that are causally faithful and ontologically verifiable?*
Method: It introduces the first systematic meta-theoretical framework, integrating philosophy of science, causal modeling, and formal semantics. Mechanistic interpretability is axiomatically defined as a model-level, ontological, causally mechanistic, and falsifiable explanatory practice. The paper rigorously defines *explanatory fidelity*, establishes a four-dimensional normative standard for it, and proposes the “Explanatory Optimism Principle” to ground the feasibility of this paradigm.
Contribution: It clarifies the essential demarcation between mechanistic interpretability and other explanation paradigms; reveals its intrinsic theoretical limits; and advances the “Explanatory Perspective Hypothesis,” positing neural networks as implicit carriers of causal structure.
📝 Abstract
Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.