Causal Abstraction: A Theoretical Foundation for Mechanistic Interpretability

📅 2023-01-11

📈 Citations: 49

✨ Influential: 3

career value

225K/year

🤖 AI Summary

This work addresses the lack of theoretical foundations in mechanistic interpretability by proposing a unified formal framework grounded in causal abstraction, designed to construct high-level simplified models that are both faithful to the low-level mechanisms of black-box AI systems and human-interpretable. Methodologically, it generalizes causal abstraction to arbitrary mechanistic transformations for the first time, rigorously defining key concepts—including polysemantic neurons, the linear representation hypothesis, modular features, and gradient fidelity—and systematically unifying over a dozen mainstream interpretability techniques (e.g., activation patching, causal tracing, sparse autoencoders). Its core contribution is the establishment of the first shared theoretical language encompassing a broad spectrum of interpretability methods—ensuring high fidelity of the abstracted model to the underlying computational mechanisms while substantially improving explanatory consistency and intelligibility.

📝 Abstract

Causal abstraction provides a theoretical foundation for mechanistic interpretability, the field concerned with providing intelligible algorithms that are faithful simplifications of the known, but opaque low-level details of black box AI models. Our contributions are (1) generalizing the theory of causal abstraction from mechanism replacement (i.e., hard and soft interventions) to arbitrary mechanism transformation (i.e., functionals from old mechanisms to new mechanisms), (2) providing a flexible, yet precise formalization for the core concepts of polysemantic neurons, the linear representation hypothesis, modular features, and graded faithfulness, and (3) unifying a variety of mechanistic interpretability methods in the common language of causal abstraction, namely, activation and path patching, causal mediation analysis, causal scrubbing, causal tracing, circuit analysis, concept erasure, sparse autoencoders, differential binary masking, distributed alignment search, and steering.

Problem

Research questions and friction points this paper is trying to address.

Establishing causal abstraction for mechanistic interpretability foundations

Generalizing causal abstraction theory to arbitrary mechanism transformations

Unifying mechanistic interpretability methods via causal abstraction framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalizes causal abstraction to arbitrary mechanism transformation

Formalizes polysemantic neurons and modular features

Unifies interpretability methods via causal abstraction

🔎 Similar Papers

The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability