PLOT: Progressive Localization via Optimal Transport in Neural Causal Abstraction

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

186K/year
🤖 AI Summary
Efficiently identifying the low-level neural units corresponding to high-level causal variables—without resorting to costly exhaustive search—is a central challenge in neural causal abstraction. This work addresses this problem by introducing optimal transport to establish soft correspondences between abstract high-level variables and candidate neural locations, and proposes a coarse-to-fine progressive localization strategy. The resulting method can rapidly yield high-precision intervention handles in isolation, or alternatively guide Distributed Alignment Search (DAS), achieving accuracy comparable to full DAS while substantially reducing computational overhead.
📝 Abstract
Causal abstraction offers a principled framework for mechanistic interpretability, aligning a high-level causal model with the low-level computation realized by a neural network through counterfactual intervention analysis. Existing methods such as distributed alignment search (DAS) learn expressive subspace interventions, but the relevant neural site is unknown a priori, so finding a handle requires a computationally burdensome search over candidate sites. We introduce PLOT (Progressive Localization via Optimal Transport), a transport-based framework that localizes causal variables from the output effect geometry of abstract and neural interventions. PLOT fits an optimal transport coupling between abstract variables and candidate neural sites, yielding a global soft correspondence that can be calibrated into intervention handles. In simple settings, a single coupling over individual neurons suffices. In larger models, PLOT is applied progressively, moving from coarse sites such as tokens, timesteps, or layers to finer supports such as coordinate groups or PCA spans, and optionally guiding DAS based on the localized signal. Across experiments of increasing complexity, transport-only PLOT handles are exceedingly fast and competitive on accuracy, while PLOT-guided DAS reaches DAS-level accuracy at a fraction of full DAS runtime, providing an efficient localization engine for causal abstraction research at scale.
Problem

Research questions and friction points this paper is trying to address.

causal abstraction
intervention localization
optimal transport
mechanistic interpretability
neural network
Innovation

Methods, ideas, or system contributions that make the work stand out.

optimal transport
causal abstraction
progressive localization
mechanistic interpretability
intervention discovery
🔎 Similar Papers
No similar papers found.