Bucketing the Good Apples: A Method for Diagnosing and Improving Causal Abstraction

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing causal abstraction methods support only global evaluation of explanation faithfulness, making it difficult to diagnose where explanations succeed or fail across specific input regions. This work introduces input space partitioning into the causal abstraction framework, enabling fine-grained localization and attribution of explanation efficacy by identifying “well-explained” and “under-explained” regions through single-intervention swaps. The approach not only reveals failure modes of high-level causal hypotheses but also offers recursive reconstruction and compositional strategies to refine explanations. Experiments demonstrate precise error analysis across multiple causal abstraction settings and show that, in toy logical tasks, the method can recover correct high-level hypotheses from scratch, thereby validating its effectiveness in constructing more accurate and scalable mechanistic explanations.

📝 Abstract

We present a method for diagnosing interpretation in neural networks by identifying an input subspace where a proposed interpretation is highly faithful. Our method is particularly useful for causal-abstraction-style interpretability, where a high-level causal hypothesis is evaluated by interchange interventions. Rather than treating interchange intervention accuracy as a single global summary, we refine this framework by partitioning the input space into well-interpreted and under-interpreted regions according to pairwise interchange-intervention behavior. This turns causal abstraction from a purely global evaluation into a more diagnostic tool: it not only measures whether an interpretation works, but also reveals where it works, where it fails, and what distinguishes the two cases. This diagnostic view also provides practical heuristics for improving interpretations. By analyzing the structure of the well-interpreted and under-interpreted regions, we can identify missing distinctions in a high-level hypothesis, discover previously unmodeled intermediate variables, and combine complementary partial interpretations into a stronger one. We instantiate this idea as a simple four-step recipe and show that it yields informative error analyses across multiple causal abstraction settings. In a toy logic task, recursively applying the recipe recovers a high-level hypothesis from scratch. More broadly, our results suggest that partitioning the input space is a useful step toward more precise, constructive, and scalable mechanistic interpretability.

Problem

Research questions and friction points this paper is trying to address.

causal abstraction

interpretability

interchange intervention

input space partitioning

faithfulness

Innovation

Methods, ideas, or system contributions that make the work stand out.

causal abstraction

interchange intervention

input space partitioning