Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

๐Ÿ“… 2026-04-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

223K/year
๐Ÿค– AI Summary
Existing feature attribution methods incur substantial computational and evaluation overhead due to the inclusion of numerous irrelevant cross-layer transformer (CLT) features introduced by uniform sampling. This work proposes PIE, a novel framework that establishes the first native end-to-end interpretability pipeline for CLTs, seamlessly integrating pruning, automated explanation, and evaluation. Central to PIE are two new mechanisms: Feature Attribution Patching (FAP) and FAP-Synergy, a synergy-aware re-ranking strategy. Evaluated on the IOI and Doc-String tasks under strict computational budgets, FAP-based methods significantly enhance behavioral fidelity and explanation qualityโ€”achieving performance comparable to that of randomly selected sets of approximately 4,000 features using only K=100, thereby yielding a ~40ร— compression ratio and substantially reducing interpretability costs.

Technology Category

Application Category

๐Ÿ“ Abstract
Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets $K \in \{50, 100, 200, 400, 800\}$, and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to $K=100$ features matches the KL fidelity that random selection from the active feature set requires $\approx 4$k features to achieve ($\approx 40\times$ compression), enabling $\approx 40\times$ fewer interpretation/evaluation calls while substantially reducing low-quality features.
Problem

Research questions and friction points this paper is trying to address.

feature attribution
cross-layer transcoder
circuit discovery
interpretability
pruning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Transcoder
Feature Attribution Patching
FAP-Synergy
Interpretability Pruning
Circuit Discovery
๐Ÿ”Ž Similar Papers
2024-06-24Neural Information Processing SystemsCitations: 13