Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

📅 2026-05-04
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
Existing causal abstraction methods support only global evaluation of explanation faithfulness, making it difficult to diagnose where explanations succeed or fail across specific input regions. This work introduces input space partitioning into the causal abstraction framework, enabling fine-grained localization and attribution of explanation efficacy by identifying “well-explained” and “under-explained” regions through single-intervention swaps. The approach not only reveals failure modes of high-level causal hypotheses but also offers recursive reconstruction and compositional strategies to refine explanations. Experiments demonstrate precise error analysis across multiple causal abstraction settings and show that, in toy logical tasks, the method can recover correct high-level hypotheses from scratch, thereby validating its effectiveness in constructing more accurate and scalable mechanistic explanations.
📝 Abstract
We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.
Problem

Research questions and friction points this paper is trying to address.

causal abstraction
interpretability
interchange intervention
input space partitioning
faithfulness
Innovation

Methods, ideas, or system contributions that make the work stand out.

causal abstraction
interchange intervention
input space partitioning
mechanistic interpretability
diagnostic interpretability
🔎 Similar Papers
No similar papers found.