A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

📅 2025-05-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the philosophical foundations and theoretical boundaries of mechanistic interpretability, centering on the question: *How can we construct explanations of neural networks that are causally faithful and ontologically verifiable?* Method: It introduces the first systematic meta-theoretical framework, integrating philosophy of science, causal modeling, and formal semantics. Mechanistic interpretability is axiomatically defined as a model-level, ontological, causally mechanistic, and falsifiable explanatory practice. The paper rigorously defines *explanatory fidelity*, establishes a four-dimensional normative standard for it, and proposes the “Explanatory Optimism Principle” to ground the feasibility of this paradigm. Contribution: It clarifies the essential demarcation between mechanistic interpretability and other explanation paradigms; reveals its intrinsic theoretical limits; and advances the “Explanatory Perspective Hypothesis,” positing neural networks as implicit carriers of causal structure.

Technology Category

Application Category

📝 Abstract
Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.
Problem

Research questions and friction points this paper is trying to address.

Defining Mechanistic Interpretability as causal explanations of neural networks
Establishing Explanatory Faithfulness to assess explanation-model fit
Proposing the Principle of Explanatory Optimism for MI success
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extract implicit explanations from neural networks
Define Explanatory Faithfulness for model assessment
Propose Mechanistic Interpretability with four criteria
🔎 Similar Papers
No similar papers found.