Prune, Interpret, Evaluate: A Cross-Layer Transcoder-Native Framework for Efficient Circuit Discovery via Feature Attribution

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing feature attribution methods incur substantial computational and evaluation overhead due to the inclusion of numerous irrelevant cross-layer transformer (CLT) features introduced by uniform sampling. This work proposes PIE, a novel framework that establishes the first native end-to-end interpretability pipeline for CLTs, seamlessly integrating pruning, automated explanation, and evaluation. Central to PIE are two new mechanisms: Feature Attribution Patching (FAP) and FAP-Synergy, a synergy-aware re-ranking strategy. Evaluated on the IOI and Doc-String tasks under strict computational budgets, FAP-based methods significantly enhance behavioral fidelity and explanation quality—achieving performance comparable to that of randomly selected sets of approximately 4,000 features using only K=100, thereby yielding a ~40× compression ratio and substantially reducing interpretability costs.

Technology Category

Application Category

📝 Abstract

Existing feature-interpretation pipelines typically operate on uniformly sampled units, but only a small fraction of cross-layer transcoder (CLT) features matter for a target behavior, with the rest resulting in expensive feature explaining and evaluating costs. We introduce the first CLT-native end-to-end framework, PIE, connecting Pruning, automatic Interpretation, and interpretation Evaluation, enabling systematic measurement of behavioral fidelity and downstream interpretability under pruning. To achieve this, we propose Feature Attribution Patching (FAP), a patch-grounded attribution method that scores CLT features by aggregating gradient-weighted write contributions, and FAP-Synergy, a synergy-aware reranking procedure. We evaluate pruning using KL-divergence behavior retention and assess interpretation quality with FADE-style metrics. Across IOI and Doc-String, across budgets $K \in \{50, 100, 200, 400, 800\}$, and across FAP, FAP-Synergy, Activation-Magnitude, and ACDC-style pruning, the FAP family consistently achieves the best or near-best fidelity, with FAP-Synergy providing its clearest gains in strict-budget regimes. On IOI with CLTs for Llama-3.2-1B and Gemma-2-2B, pruning to $K=100$ features matches the KL fidelity that random selection from the active feature set requires $\approx 4$k features to achieve ($\approx 40\times$ compression), enabling $\approx 40\times$ fewer interpretation/evaluation calls while substantially reducing low-quality features.

Problem

Research questions and friction points this paper is trying to address.

feature attribution

cross-layer transcoder

circuit discovery

interpretability

pruning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Layer Transcoder

Feature Attribution Patching

FAP-Synergy